Mercurial: mercurial/help/internals/linelog.txt comparison

comparison mercurial/help/internals/linelog.txt @ 38795:422d661056be

linelog: add a Python implementation of the linelog datastructure This datastructure was originally developed by Jun Wu at Facebook, inspired by SCCS weaves. It's useful as a cache for blame information, but also is the magic that makes `hg absorb` easy to implement. In service of importing the code to Mercurial, I wanted to actually /understand/ it, and once I did I decided to take a run at implementing it. The help/internals/linelog.txt document is the README from Jun Wu's implementaiton. It all applies to our linelog implementation. Differential Revision: https://phab.mercurial-scm.org/D3990

author	Augie Fackler <augie@google.com>
date	Mon, 30 Jul 2018 10:42:37 -0400
parents
children	c10be3fc200b

comparison

equal deleted inserted replaced

-:1d01cf0416a5
+:422d661056be
+linelog is a storage format inspired by the "Interleaved deltas" idea. See
+https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction.
+0. SCCS Weave
+To understand what linelog is, first we have a quick look at a simplified
+(with header removed) SCCS weave format, which is an implementation of the
+"Interleaved deltas" idea.
+0.1 Basic SCCS Weave File Format
+A SCCS weave file consists of plain text lines. Each line is either a
+special instruction starting with "^A" or part of the content of the real
+file the weave tracks. There are 3 important operations, where REV denotes
+the revision number:
+^AI REV, marking the beginning of an insertion block introduced by REV
+^AD REV, marking the beginning of a deletion block introduced by REV
+^AE REV, marking the end of the block started by "^AI REV" or "^AD REV"
+Note on revision numbers: For any two different revision numbers, one must
+be an ancestor of the other to make them comparable. This enforces linear
+history. Besides, the comparison functions (">=", "<") should be efficient.
+This means, if revisions are strings like git or hg, an external map is
+required to convert them into integers.
+For example, to represent the following changes:
+REV 1 | REV 2 | REV 3
+------+-------+-------
+a     | a     | a
+b     | b     | 2
+c     | 1     | c
+| 2     |
+| c     |
+A possible weave file looks like:
+^AI 1
+a
+^AD 3
+b
+^AI 2
+1
+^AE 3
+2
+^AE 2
+c
+^AE 1
+An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In
+the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3".
+Therefore we need some extra information for "^AE". The SCCS weave uses a
+revision number. It could also be a boolean value about whether it is an
+insertion or a deletion (see section 0.4).
+0.2 Checkout
+The "checkout" operation is to retrieve file content at a given revision,
+say X. It's doable by going through the file line by line and:
+- If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there
+- If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there
+- Ignore ^AE
+- For normal lines, just output them
+0.3 Annotate
+The "annotate" operation is to show extra metadata like the revision number
+and the original line number a line comes from.
+It's basically just a "Checkout". For the extra metadata, they can be stored
+side by side with the line contents. Alternatively, we can infer the
+revision number from "^AI"s.
+Some SCM tools have to calculate diffs on the fly and thus are much slower
+on this operation.
+0.4 Tree Structure
+The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE"
+blocks can be interleaved.
+If we consider insertions and deletions separately, they can form tree
+structures, respectively.
++--- ^AI 1        +--- ^AD 3
+| +- ^AI 2        | +- ^AD 2
+| |               | |
+| +- ^AE 2        | +- ^AE 2
+|                 |
++--- ^AE 1        +--- ^AE 3
+More specifically, it's possible to build a tree for all insertions, where
+the tree node has the structure "(rev, startline, endline)". "startline" is
+the line number of "^AI" and "endline" is the line number of the matched
+"^AE".  The tree will have these properties:
+1. child.rev > parent.rev
+2. child.startline > parent.startline
+3. child.endline < parent.endline
+A similar tree for all deletions can also be built with the first property
+changed to:
+1. child.rev < parent.rev
+0.5 Malformed Cases
+The following cases are considered malformed in our implementation:
+1. Interleaved insertions, or interleaved deletions.
+It can be rewritten to a non-interleaved tree structure.
+^AI/D x     ^AI/D x
+^AI/D y  -> ^AI/D y
+^AE x       ^AE y
+^AE y       ^AE x
+2. Nested insertions, where the inner one has a smaller revision number.
+It can be rewritten to a non-nested form.
+^AI x + 1     ^AI x + 1
+^AI x      -> ^AE x + 1
+^AE x         ^AI x
+^AE x + 1     ^AE x
+3. Insertion or deletion inside another deletion, where the outer deletion
+block has a smaller revision number.
+^AD x          ^AD x
+^AI/D x + 1 -> ^AE x
+^AE x + 1      ^AI/D x + 1
+^AE x          ^AE x
+Some of them may be valid in other implementations for special purposes. For
+example, to "revive" a previously deleted block in a newer revision.
+0.6 Cases Can Be Optimized
+It's always better to get things nested. For example, the left is more
+efficient than the right while they represent the same content:
++--- ^AD 2          +- ^AD 1
+| +- ^AD 1          |   LINE A
+| |   LINE A        +- ^AE 1
+| +- ^AE 1          +- ^AD 2
+|     LINE B        |   LINE B
++--- ^AE 2          +- ^AE 2
+Our implementation sometimes generates the less efficient data. To always
+get the optimal form, it requires extra code complexity that seems unworthy.
+0.7 Inefficiency
+The file format can be slow because:
+- Inserting a new line at position P requires rewriting all data after P.
+- Finding "^AE" requires walking through the content (O(N), where N is the
+number of lines between "^AI/D" and "^AE").
+1. Linelog
+The linelog is a binary format that dedicates to speed up mercurial (or
+git)'s "annotate" operation. It's designed to avoid issues mentioned in
+section 0.7.
+1.1 Content Stored
+Linelog is not another storage for file contents. It only stores line
+numbers and corresponding revision numbers, instead of actual line content.
+This is okay for the "annotate" operation because usually the external
+source is fast to checkout the content of a file at a specific revision.
+A typical SCCS weave is also fast on the "grep" operation, which needs
+random accesses to line contents from different revisions of a file. This
+can be slow with linelog's no-line-content design. However we could use
+an extra map ((rev, line num) -> line content) to speed it up.
+Note the revision numbers in linelog should be independent from mercurial
+integer revision numbers. There should be some mapping between linelog rev
+and hg hash stored side by side, to make the files reusable after being
+copied to another machine.
+1.2 Basic Format
+A linelog file consists of "instruction"s. An "instruction" can be either:
+- JGE  REV ADDR     # jump to ADDR if rev >= REV
+- JL   REV ADDR     # jump to ADDR if rev < REV
+- LINE REV LINENUM  # append the (LINENUM+1)-th line in revision REV
+For example, here is the example linelog representing the same file with
+3 revisions mentioned in section 0.1:
+SCCS  |    Linelog
+Weave | Addr : Instruction
+------+------+-------------
+^AI 1 |    0 : JL   1 8
+a     |    1 : LINE 1 0
+^AD 3 |    2 : JGE  3 6
+b     |    3 : LINE 1 1
+^AI 2 |    4 : JL   2 7
+1     |    5 : LINE 2 2
+^AE 3 |
+2     |    6 : LINE 2 3
+^AE 2 |
+c     |    7 : LINE 1 2
+^AE 1 |
+|    8 : END
+This way, "find ^AE" is O(1) because we just jump there. And we can insert
+new lines without rewriting most part of the file by appending new lines and
+changing a single instruction to jump to them.
+The current implementation uses 64 bits for an instruction: The opcode (JGE,
+JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32
+bits. It also stores the max revision number and buffer size at the first
+64 bits for quick access to these values.
+1.3 Comparing with Mercurial's revlog format
+Apparently, linelog is very different from revlog: linelog stores rev and
+line numbers, while revlog has line contents and other metadata (like
+parents, flags). However, the revlog format could also be used to store rev
+and line numbers. For example, to speed up the annotate operation, we could
+also pre-calculate annotate results and just store them using the revlog
+format.
+Therefore, linelog is actually somehow similar to revlog, with the important
+trade-off that it only supports linear history (mentioned in section 0.1).
+Essentially, the differences are:
+a) Linelog is full of deltas, while revlog could contain full file
+contents sometimes. So linelog is smaller. Revlog could trade
+reconstruction speed for file size - best case, revlog is as small as
+linelog.
+b) The interleaved delta structure allows skipping large portion of
+uninteresting deltas so linelog's content reconstruction is faster than
+the delta-only version of revlog (however it's possible to construct
+a case where interleaved deltas degrade to plain deltas, so linelog
+worst case would be delta-only revlog). Revlog could trade file size
+for reconstruction speed.
+c) Linelog implicitly maintains the order of all lines it stores. So it
+could dump all the lines from all revisions, with a reasonable order.
+While revlog could also dump all line additions, it requires extra
+computation to figure out the order putting those lines - that's some
+kind of "merge".
+"c" makes "hg absorb" easier to implement and makes it possible to do
+"annotate --deleted".

Mercurial > hg

comparison mercurial/help/internals/linelog.txt @ 38795:422d661056be