diff mercurial/help/internals/linelog.txt @ 38795:422d661056be

linelog: add a Python implementation of the linelog datastructure This datastructure was originally developed by Jun Wu at Facebook, inspired by SCCS weaves. It's useful as a cache for blame information, but also is the magic that makes `hg absorb` easy to implement. In service of importing the code to Mercurial, I wanted to actually /understand/ it, and once I did I decided to take a run at implementing it. The help/internals/linelog.txt document is the README from Jun Wu's implementaiton. It all applies to our linelog implementation. Differential Revision: https://phab.mercurial-scm.org/D3990
author Augie Fackler <augie@google.com>
date Mon, 30 Jul 2018 10:42:37 -0400
parents
children c10be3fc200b
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mercurial/help/internals/linelog.txt	Mon Jul 30 10:42:37 2018 -0400
@@ -0,0 +1,251 @@
+linelog is a storage format inspired by the "Interleaved deltas" idea. See
+https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction.
+
+0. SCCS Weave
+
+  To understand what linelog is, first we have a quick look at a simplified
+  (with header removed) SCCS weave format, which is an implementation of the
+  "Interleaved deltas" idea.
+
+0.1 Basic SCCS Weave File Format
+
+  A SCCS weave file consists of plain text lines. Each line is either a
+  special instruction starting with "^A" or part of the content of the real
+  file the weave tracks. There are 3 important operations, where REV denotes
+  the revision number:
+
+    ^AI REV, marking the beginning of an insertion block introduced by REV
+    ^AD REV, marking the beginning of a deletion block introduced by REV
+    ^AE REV, marking the end of the block started by "^AI REV" or "^AD REV"
+
+  Note on revision numbers: For any two different revision numbers, one must
+  be an ancestor of the other to make them comparable. This enforces linear
+  history. Besides, the comparison functions (">=", "<") should be efficient.
+  This means, if revisions are strings like git or hg, an external map is
+  required to convert them into integers.
+
+  For example, to represent the following changes:
+
+    REV 1 | REV 2 | REV 3
+    ------+-------+-------
+    a     | a     | a
+    b     | b     | 2
+    c     | 1     | c
+          | 2     |
+          | c     |
+
+  A possible weave file looks like:
+
+    ^AI 1
+    a
+    ^AD 3
+    b
+    ^AI 2
+    1
+    ^AE 3
+    2
+    ^AE 2
+    c
+    ^AE 1
+
+  An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In
+  the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3".
+  Therefore we need some extra information for "^AE". The SCCS weave uses a
+  revision number. It could also be a boolean value about whether it is an
+  insertion or a deletion (see section 0.4).
+
+0.2 Checkout
+
+  The "checkout" operation is to retrieve file content at a given revision,
+  say X. It's doable by going through the file line by line and:
+
+    - If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there
+    - If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there
+    - Ignore ^AE
+    - For normal lines, just output them
+
+0.3 Annotate
+
+  The "annotate" operation is to show extra metadata like the revision number
+  and the original line number a line comes from.
+
+  It's basically just a "Checkout". For the extra metadata, they can be stored
+  side by side with the line contents. Alternatively, we can infer the
+  revision number from "^AI"s.
+
+  Some SCM tools have to calculate diffs on the fly and thus are much slower
+  on this operation.
+
+0.4 Tree Structure
+
+  The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE"
+  blocks can be interleaved.
+
+  If we consider insertions and deletions separately, they can form tree
+  structures, respectively.
+
+    +--- ^AI 1        +--- ^AD 3
+    | +- ^AI 2        | +- ^AD 2
+    | |               | |
+    | +- ^AE 2        | +- ^AE 2
+    |                 |
+    +--- ^AE 1        +--- ^AE 3
+
+  More specifically, it's possible to build a tree for all insertions, where
+  the tree node has the structure "(rev, startline, endline)". "startline" is
+  the line number of "^AI" and "endline" is the line number of the matched
+  "^AE".  The tree will have these properties:
+
+    1. child.rev > parent.rev
+    2. child.startline > parent.startline
+    3. child.endline < parent.endline
+
+  A similar tree for all deletions can also be built with the first property
+  changed to:
+
+    1. child.rev < parent.rev
+
+0.5 Malformed Cases
+
+  The following cases are considered malformed in our implementation:
+
+    1. Interleaved insertions, or interleaved deletions.
+       It can be rewritten to a non-interleaved tree structure.
+
+       ^AI/D x     ^AI/D x
+       ^AI/D y  -> ^AI/D y
+       ^AE x       ^AE y
+       ^AE y       ^AE x
+
+    2. Nested insertions, where the inner one has a smaller revision number.
+       It can be rewritten to a non-nested form.
+
+       ^AI x + 1     ^AI x + 1
+       ^AI x      -> ^AE x + 1
+       ^AE x         ^AI x
+       ^AE x + 1     ^AE x
+
+    3. Insertion or deletion inside another deletion, where the outer deletion
+       block has a smaller revision number.
+
+       ^AD x          ^AD x
+       ^AI/D x + 1 -> ^AE x
+       ^AE x + 1      ^AI/D x + 1
+       ^AE x          ^AE x
+
+  Some of them may be valid in other implementations for special purposes. For
+  example, to "revive" a previously deleted block in a newer revision.
+
+0.6 Cases Can Be Optimized
+
+  It's always better to get things nested. For example, the left is more
+  efficient than the right while they represent the same content:
+
+    +--- ^AD 2          +- ^AD 1
+    | +- ^AD 1          |   LINE A
+    | |   LINE A        +- ^AE 1
+    | +- ^AE 1          +- ^AD 2
+    |     LINE B        |   LINE B
+    +--- ^AE 2          +- ^AE 2
+
+  Our implementation sometimes generates the less efficient data. To always
+  get the optimal form, it requires extra code complexity that seems unworthy.
+
+0.7 Inefficiency
+
+  The file format can be slow because:
+
+  - Inserting a new line at position P requires rewriting all data after P.
+  - Finding "^AE" requires walking through the content (O(N), where N is the
+    number of lines between "^AI/D" and "^AE").
+
+1. Linelog
+
+  The linelog is a binary format that dedicates to speed up mercurial (or
+  git)'s "annotate" operation. It's designed to avoid issues mentioned in
+  section 0.7.
+
+1.1 Content Stored
+
+  Linelog is not another storage for file contents. It only stores line
+  numbers and corresponding revision numbers, instead of actual line content.
+  This is okay for the "annotate" operation because usually the external
+  source is fast to checkout the content of a file at a specific revision.
+
+  A typical SCCS weave is also fast on the "grep" operation, which needs
+  random accesses to line contents from different revisions of a file. This
+  can be slow with linelog's no-line-content design. However we could use
+  an extra map ((rev, line num) -> line content) to speed it up.
+
+  Note the revision numbers in linelog should be independent from mercurial
+  integer revision numbers. There should be some mapping between linelog rev
+  and hg hash stored side by side, to make the files reusable after being
+  copied to another machine.
+
+1.2 Basic Format
+
+  A linelog file consists of "instruction"s. An "instruction" can be either:
+
+    - JGE  REV ADDR     # jump to ADDR if rev >= REV
+    - JL   REV ADDR     # jump to ADDR if rev < REV
+    - LINE REV LINENUM  # append the (LINENUM+1)-th line in revision REV
+
+  For example, here is the example linelog representing the same file with
+  3 revisions mentioned in section 0.1:
+
+    SCCS  |    Linelog
+    Weave | Addr : Instruction
+    ------+------+-------------
+    ^AI 1 |    0 : JL   1 8
+    a     |    1 : LINE 1 0
+    ^AD 3 |    2 : JGE  3 6
+    b     |    3 : LINE 1 1
+    ^AI 2 |    4 : JL   2 7
+    1     |    5 : LINE 2 2
+    ^AE 3 |
+    2     |    6 : LINE 2 3
+    ^AE 2 |
+    c     |    7 : LINE 1 2
+    ^AE 1 |
+          |    8 : END
+
+  This way, "find ^AE" is O(1) because we just jump there. And we can insert
+  new lines without rewriting most part of the file by appending new lines and
+  changing a single instruction to jump to them.
+
+  The current implementation uses 64 bits for an instruction: The opcode (JGE,
+  JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32
+  bits. It also stores the max revision number and buffer size at the first
+  64 bits for quick access to these values.
+
+1.3 Comparing with Mercurial's revlog format
+
+  Apparently, linelog is very different from revlog: linelog stores rev and
+  line numbers, while revlog has line contents and other metadata (like
+  parents, flags). However, the revlog format could also be used to store rev
+  and line numbers. For example, to speed up the annotate operation, we could
+  also pre-calculate annotate results and just store them using the revlog
+  format.
+
+  Therefore, linelog is actually somehow similar to revlog, with the important
+  trade-off that it only supports linear history (mentioned in section 0.1).
+  Essentially, the differences are:
+
+    a) Linelog is full of deltas, while revlog could contain full file
+       contents sometimes. So linelog is smaller. Revlog could trade
+       reconstruction speed for file size - best case, revlog is as small as
+       linelog.
+    b) The interleaved delta structure allows skipping large portion of
+       uninteresting deltas so linelog's content reconstruction is faster than
+       the delta-only version of revlog (however it's possible to construct
+       a case where interleaved deltas degrade to plain deltas, so linelog
+       worst case would be delta-only revlog). Revlog could trade file size
+       for reconstruction speed.
+    c) Linelog implicitly maintains the order of all lines it stores. So it
+       could dump all the lines from all revisions, with a reasonable order.
+       While revlog could also dump all line additions, it requires extra
+       computation to figure out the order putting those lines - that's some
+       kind of "merge".
+
+  "c" makes "hg absorb" easier to implement and makes it possible to do
+  "annotate --deleted".