comparison mercurial/help/internals/linelog.txt @ 38795:422d661056be

linelog: add a Python implementation of the linelog datastructure This datastructure was originally developed by Jun Wu at Facebook, inspired by SCCS weaves. It's useful as a cache for blame information, but also is the magic that makes `hg absorb` easy to implement. In service of importing the code to Mercurial, I wanted to actually /understand/ it, and once I did I decided to take a run at implementing it. The help/internals/linelog.txt document is the README from Jun Wu's implementaiton. It all applies to our linelog implementation. Differential Revision: https://phab.mercurial-scm.org/D3990
author Augie Fackler <augie@google.com>
date Mon, 30 Jul 2018 10:42:37 -0400
parents
children c10be3fc200b
comparison
equal deleted inserted replaced
38794:1d01cf0416a5 38795:422d661056be
1 linelog is a storage format inspired by the "Interleaved deltas" idea. See
2 https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction.
3
4 0. SCCS Weave
5
6 To understand what linelog is, first we have a quick look at a simplified
7 (with header removed) SCCS weave format, which is an implementation of the
8 "Interleaved deltas" idea.
9
10 0.1 Basic SCCS Weave File Format
11
12 A SCCS weave file consists of plain text lines. Each line is either a
13 special instruction starting with "^A" or part of the content of the real
14 file the weave tracks. There are 3 important operations, where REV denotes
15 the revision number:
16
17 ^AI REV, marking the beginning of an insertion block introduced by REV
18 ^AD REV, marking the beginning of a deletion block introduced by REV
19 ^AE REV, marking the end of the block started by "^AI REV" or "^AD REV"
20
21 Note on revision numbers: For any two different revision numbers, one must
22 be an ancestor of the other to make them comparable. This enforces linear
23 history. Besides, the comparison functions (">=", "<") should be efficient.
24 This means, if revisions are strings like git or hg, an external map is
25 required to convert them into integers.
26
27 For example, to represent the following changes:
28
29 REV 1 | REV 2 | REV 3
30 ------+-------+-------
31 a | a | a
32 b | b | 2
33 c | 1 | c
34 | 2 |
35 | c |
36
37 A possible weave file looks like:
38
39 ^AI 1
40 a
41 ^AD 3
42 b
43 ^AI 2
44 1
45 ^AE 3
46 2
47 ^AE 2
48 c
49 ^AE 1
50
51 An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In
52 the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3".
53 Therefore we need some extra information for "^AE". The SCCS weave uses a
54 revision number. It could also be a boolean value about whether it is an
55 insertion or a deletion (see section 0.4).
56
57 0.2 Checkout
58
59 The "checkout" operation is to retrieve file content at a given revision,
60 say X. It's doable by going through the file line by line and:
61
62 - If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there
63 - If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there
64 - Ignore ^AE
65 - For normal lines, just output them
66
67 0.3 Annotate
68
69 The "annotate" operation is to show extra metadata like the revision number
70 and the original line number a line comes from.
71
72 It's basically just a "Checkout". For the extra metadata, they can be stored
73 side by side with the line contents. Alternatively, we can infer the
74 revision number from "^AI"s.
75
76 Some SCM tools have to calculate diffs on the fly and thus are much slower
77 on this operation.
78
79 0.4 Tree Structure
80
81 The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE"
82 blocks can be interleaved.
83
84 If we consider insertions and deletions separately, they can form tree
85 structures, respectively.
86
87 +--- ^AI 1 +--- ^AD 3
88 | +- ^AI 2 | +- ^AD 2
89 | | | |
90 | +- ^AE 2 | +- ^AE 2
91 | |
92 +--- ^AE 1 +--- ^AE 3
93
94 More specifically, it's possible to build a tree for all insertions, where
95 the tree node has the structure "(rev, startline, endline)". "startline" is
96 the line number of "^AI" and "endline" is the line number of the matched
97 "^AE". The tree will have these properties:
98
99 1. child.rev > parent.rev
100 2. child.startline > parent.startline
101 3. child.endline < parent.endline
102
103 A similar tree for all deletions can also be built with the first property
104 changed to:
105
106 1. child.rev < parent.rev
107
108 0.5 Malformed Cases
109
110 The following cases are considered malformed in our implementation:
111
112 1. Interleaved insertions, or interleaved deletions.
113 It can be rewritten to a non-interleaved tree structure.
114
115 ^AI/D x ^AI/D x
116 ^AI/D y -> ^AI/D y
117 ^AE x ^AE y
118 ^AE y ^AE x
119
120 2. Nested insertions, where the inner one has a smaller revision number.
121 It can be rewritten to a non-nested form.
122
123 ^AI x + 1 ^AI x + 1
124 ^AI x -> ^AE x + 1
125 ^AE x ^AI x
126 ^AE x + 1 ^AE x
127
128 3. Insertion or deletion inside another deletion, where the outer deletion
129 block has a smaller revision number.
130
131 ^AD x ^AD x
132 ^AI/D x + 1 -> ^AE x
133 ^AE x + 1 ^AI/D x + 1
134 ^AE x ^AE x
135
136 Some of them may be valid in other implementations for special purposes. For
137 example, to "revive" a previously deleted block in a newer revision.
138
139 0.6 Cases Can Be Optimized
140
141 It's always better to get things nested. For example, the left is more
142 efficient than the right while they represent the same content:
143
144 +--- ^AD 2 +- ^AD 1
145 | +- ^AD 1 | LINE A
146 | | LINE A +- ^AE 1
147 | +- ^AE 1 +- ^AD 2
148 | LINE B | LINE B
149 +--- ^AE 2 +- ^AE 2
150
151 Our implementation sometimes generates the less efficient data. To always
152 get the optimal form, it requires extra code complexity that seems unworthy.
153
154 0.7 Inefficiency
155
156 The file format can be slow because:
157
158 - Inserting a new line at position P requires rewriting all data after P.
159 - Finding "^AE" requires walking through the content (O(N), where N is the
160 number of lines between "^AI/D" and "^AE").
161
162 1. Linelog
163
164 The linelog is a binary format that dedicates to speed up mercurial (or
165 git)'s "annotate" operation. It's designed to avoid issues mentioned in
166 section 0.7.
167
168 1.1 Content Stored
169
170 Linelog is not another storage for file contents. It only stores line
171 numbers and corresponding revision numbers, instead of actual line content.
172 This is okay for the "annotate" operation because usually the external
173 source is fast to checkout the content of a file at a specific revision.
174
175 A typical SCCS weave is also fast on the "grep" operation, which needs
176 random accesses to line contents from different revisions of a file. This
177 can be slow with linelog's no-line-content design. However we could use
178 an extra map ((rev, line num) -> line content) to speed it up.
179
180 Note the revision numbers in linelog should be independent from mercurial
181 integer revision numbers. There should be some mapping between linelog rev
182 and hg hash stored side by side, to make the files reusable after being
183 copied to another machine.
184
185 1.2 Basic Format
186
187 A linelog file consists of "instruction"s. An "instruction" can be either:
188
189 - JGE REV ADDR # jump to ADDR if rev >= REV
190 - JL REV ADDR # jump to ADDR if rev < REV
191 - LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV
192
193 For example, here is the example linelog representing the same file with
194 3 revisions mentioned in section 0.1:
195
196 SCCS | Linelog
197 Weave | Addr : Instruction
198 ------+------+-------------
199 ^AI 1 | 0 : JL 1 8
200 a | 1 : LINE 1 0
201 ^AD 3 | 2 : JGE 3 6
202 b | 3 : LINE 1 1
203 ^AI 2 | 4 : JL 2 7
204 1 | 5 : LINE 2 2
205 ^AE 3 |
206 2 | 6 : LINE 2 3
207 ^AE 2 |
208 c | 7 : LINE 1 2
209 ^AE 1 |
210 | 8 : END
211
212 This way, "find ^AE" is O(1) because we just jump there. And we can insert
213 new lines without rewriting most part of the file by appending new lines and
214 changing a single instruction to jump to them.
215
216 The current implementation uses 64 bits for an instruction: The opcode (JGE,
217 JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32
218 bits. It also stores the max revision number and buffer size at the first
219 64 bits for quick access to these values.
220
221 1.3 Comparing with Mercurial's revlog format
222
223 Apparently, linelog is very different from revlog: linelog stores rev and
224 line numbers, while revlog has line contents and other metadata (like
225 parents, flags). However, the revlog format could also be used to store rev
226 and line numbers. For example, to speed up the annotate operation, we could
227 also pre-calculate annotate results and just store them using the revlog
228 format.
229
230 Therefore, linelog is actually somehow similar to revlog, with the important
231 trade-off that it only supports linear history (mentioned in section 0.1).
232 Essentially, the differences are:
233
234 a) Linelog is full of deltas, while revlog could contain full file
235 contents sometimes. So linelog is smaller. Revlog could trade
236 reconstruction speed for file size - best case, revlog is as small as
237 linelog.
238 b) The interleaved delta structure allows skipping large portion of
239 uninteresting deltas so linelog's content reconstruction is faster than
240 the delta-only version of revlog (however it's possible to construct
241 a case where interleaved deltas degrade to plain deltas, so linelog
242 worst case would be delta-only revlog). Revlog could trade file size
243 for reconstruction speed.
244 c) Linelog implicitly maintains the order of all lines it stores. So it
245 could dump all the lines from all revisions, with a reasonable order.
246 While revlog could also dump all line additions, it requires extra
247 computation to figure out the order putting those lines - that's some
248 kind of "merge".
249
250 "c" makes "hg absorb" easier to implement and makes it possible to do
251 "annotate --deleted".