Mercurial > hg
diff mercurial/help/internals/linelog.txt @ 38795:422d661056be
linelog: add a Python implementation of the linelog datastructure
This datastructure was originally developed by Jun Wu at Facebook,
inspired by SCCS weaves. It's useful as a cache for blame information,
but also is the magic that makes `hg absorb` easy to implement. In
service of importing the code to Mercurial, I wanted to actually
/understand/ it, and once I did I decided to take a run at
implementing it.
The help/internals/linelog.txt document is the README from Jun Wu's
implementaiton. It all applies to our linelog implementation.
Differential Revision: https://phab.mercurial-scm.org/D3990
author | Augie Fackler <augie@google.com> |
---|---|
date | Mon, 30 Jul 2018 10:42:37 -0400 |
parents | |
children | c10be3fc200b |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/mercurial/help/internals/linelog.txt Mon Jul 30 10:42:37 2018 -0400 @@ -0,0 +1,251 @@ +linelog is a storage format inspired by the "Interleaved deltas" idea. See +https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction. + +0. SCCS Weave + + To understand what linelog is, first we have a quick look at a simplified + (with header removed) SCCS weave format, which is an implementation of the + "Interleaved deltas" idea. + +0.1 Basic SCCS Weave File Format + + A SCCS weave file consists of plain text lines. Each line is either a + special instruction starting with "^A" or part of the content of the real + file the weave tracks. There are 3 important operations, where REV denotes + the revision number: + + ^AI REV, marking the beginning of an insertion block introduced by REV + ^AD REV, marking the beginning of a deletion block introduced by REV + ^AE REV, marking the end of the block started by "^AI REV" or "^AD REV" + + Note on revision numbers: For any two different revision numbers, one must + be an ancestor of the other to make them comparable. This enforces linear + history. Besides, the comparison functions (">=", "<") should be efficient. + This means, if revisions are strings like git or hg, an external map is + required to convert them into integers. + + For example, to represent the following changes: + + REV 1 | REV 2 | REV 3 + ------+-------+------- + a | a | a + b | b | 2 + c | 1 | c + | 2 | + | c | + + A possible weave file looks like: + + ^AI 1 + a + ^AD 3 + b + ^AI 2 + 1 + ^AE 3 + 2 + ^AE 2 + c + ^AE 1 + + An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In + the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3". + Therefore we need some extra information for "^AE". The SCCS weave uses a + revision number. It could also be a boolean value about whether it is an + insertion or a deletion (see section 0.4). + +0.2 Checkout + + The "checkout" operation is to retrieve file content at a given revision, + say X. It's doable by going through the file line by line and: + + - If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there + - If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there + - Ignore ^AE + - For normal lines, just output them + +0.3 Annotate + + The "annotate" operation is to show extra metadata like the revision number + and the original line number a line comes from. + + It's basically just a "Checkout". For the extra metadata, they can be stored + side by side with the line contents. Alternatively, we can infer the + revision number from "^AI"s. + + Some SCM tools have to calculate diffs on the fly and thus are much slower + on this operation. + +0.4 Tree Structure + + The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE" + blocks can be interleaved. + + If we consider insertions and deletions separately, they can form tree + structures, respectively. + + +--- ^AI 1 +--- ^AD 3 + | +- ^AI 2 | +- ^AD 2 + | | | | + | +- ^AE 2 | +- ^AE 2 + | | + +--- ^AE 1 +--- ^AE 3 + + More specifically, it's possible to build a tree for all insertions, where + the tree node has the structure "(rev, startline, endline)". "startline" is + the line number of "^AI" and "endline" is the line number of the matched + "^AE". The tree will have these properties: + + 1. child.rev > parent.rev + 2. child.startline > parent.startline + 3. child.endline < parent.endline + + A similar tree for all deletions can also be built with the first property + changed to: + + 1. child.rev < parent.rev + +0.5 Malformed Cases + + The following cases are considered malformed in our implementation: + + 1. Interleaved insertions, or interleaved deletions. + It can be rewritten to a non-interleaved tree structure. + + ^AI/D x ^AI/D x + ^AI/D y -> ^AI/D y + ^AE x ^AE y + ^AE y ^AE x + + 2. Nested insertions, where the inner one has a smaller revision number. + It can be rewritten to a non-nested form. + + ^AI x + 1 ^AI x + 1 + ^AI x -> ^AE x + 1 + ^AE x ^AI x + ^AE x + 1 ^AE x + + 3. Insertion or deletion inside another deletion, where the outer deletion + block has a smaller revision number. + + ^AD x ^AD x + ^AI/D x + 1 -> ^AE x + ^AE x + 1 ^AI/D x + 1 + ^AE x ^AE x + + Some of them may be valid in other implementations for special purposes. For + example, to "revive" a previously deleted block in a newer revision. + +0.6 Cases Can Be Optimized + + It's always better to get things nested. For example, the left is more + efficient than the right while they represent the same content: + + +--- ^AD 2 +- ^AD 1 + | +- ^AD 1 | LINE A + | | LINE A +- ^AE 1 + | +- ^AE 1 +- ^AD 2 + | LINE B | LINE B + +--- ^AE 2 +- ^AE 2 + + Our implementation sometimes generates the less efficient data. To always + get the optimal form, it requires extra code complexity that seems unworthy. + +0.7 Inefficiency + + The file format can be slow because: + + - Inserting a new line at position P requires rewriting all data after P. + - Finding "^AE" requires walking through the content (O(N), where N is the + number of lines between "^AI/D" and "^AE"). + +1. Linelog + + The linelog is a binary format that dedicates to speed up mercurial (or + git)'s "annotate" operation. It's designed to avoid issues mentioned in + section 0.7. + +1.1 Content Stored + + Linelog is not another storage for file contents. It only stores line + numbers and corresponding revision numbers, instead of actual line content. + This is okay for the "annotate" operation because usually the external + source is fast to checkout the content of a file at a specific revision. + + A typical SCCS weave is also fast on the "grep" operation, which needs + random accesses to line contents from different revisions of a file. This + can be slow with linelog's no-line-content design. However we could use + an extra map ((rev, line num) -> line content) to speed it up. + + Note the revision numbers in linelog should be independent from mercurial + integer revision numbers. There should be some mapping between linelog rev + and hg hash stored side by side, to make the files reusable after being + copied to another machine. + +1.2 Basic Format + + A linelog file consists of "instruction"s. An "instruction" can be either: + + - JGE REV ADDR # jump to ADDR if rev >= REV + - JL REV ADDR # jump to ADDR if rev < REV + - LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV + + For example, here is the example linelog representing the same file with + 3 revisions mentioned in section 0.1: + + SCCS | Linelog + Weave | Addr : Instruction + ------+------+------------- + ^AI 1 | 0 : JL 1 8 + a | 1 : LINE 1 0 + ^AD 3 | 2 : JGE 3 6 + b | 3 : LINE 1 1 + ^AI 2 | 4 : JL 2 7 + 1 | 5 : LINE 2 2 + ^AE 3 | + 2 | 6 : LINE 2 3 + ^AE 2 | + c | 7 : LINE 1 2 + ^AE 1 | + | 8 : END + + This way, "find ^AE" is O(1) because we just jump there. And we can insert + new lines without rewriting most part of the file by appending new lines and + changing a single instruction to jump to them. + + The current implementation uses 64 bits for an instruction: The opcode (JGE, + JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32 + bits. It also stores the max revision number and buffer size at the first + 64 bits for quick access to these values. + +1.3 Comparing with Mercurial's revlog format + + Apparently, linelog is very different from revlog: linelog stores rev and + line numbers, while revlog has line contents and other metadata (like + parents, flags). However, the revlog format could also be used to store rev + and line numbers. For example, to speed up the annotate operation, we could + also pre-calculate annotate results and just store them using the revlog + format. + + Therefore, linelog is actually somehow similar to revlog, with the important + trade-off that it only supports linear history (mentioned in section 0.1). + Essentially, the differences are: + + a) Linelog is full of deltas, while revlog could contain full file + contents sometimes. So linelog is smaller. Revlog could trade + reconstruction speed for file size - best case, revlog is as small as + linelog. + b) The interleaved delta structure allows skipping large portion of + uninteresting deltas so linelog's content reconstruction is faster than + the delta-only version of revlog (however it's possible to construct + a case where interleaved deltas degrade to plain deltas, so linelog + worst case would be delta-only revlog). Revlog could trade file size + for reconstruction speed. + c) Linelog implicitly maintains the order of all lines it stores. So it + could dump all the lines from all revisions, with a reasonable order. + While revlog could also dump all line additions, it requires extra + computation to figure out the order putting those lines - that's some + kind of "merge". + + "c" makes "hg absorb" easier to implement and makes it possible to do + "annotate --deleted".