Mercurial > hg
comparison mercurial/help/internals/linelog.txt @ 38795:422d661056be
linelog: add a Python implementation of the linelog datastructure
This datastructure was originally developed by Jun Wu at Facebook,
inspired by SCCS weaves. It's useful as a cache for blame information,
but also is the magic that makes `hg absorb` easy to implement. In
service of importing the code to Mercurial, I wanted to actually
/understand/ it, and once I did I decided to take a run at
implementing it.
The help/internals/linelog.txt document is the README from Jun Wu's
implementaiton. It all applies to our linelog implementation.
Differential Revision: https://phab.mercurial-scm.org/D3990
author | Augie Fackler <augie@google.com> |
---|---|
date | Mon, 30 Jul 2018 10:42:37 -0400 |
parents | |
children | c10be3fc200b |
comparison
equal
deleted
inserted
replaced
38794:1d01cf0416a5 | 38795:422d661056be |
---|---|
1 linelog is a storage format inspired by the "Interleaved deltas" idea. See | |
2 https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction. | |
3 | |
4 0. SCCS Weave | |
5 | |
6 To understand what linelog is, first we have a quick look at a simplified | |
7 (with header removed) SCCS weave format, which is an implementation of the | |
8 "Interleaved deltas" idea. | |
9 | |
10 0.1 Basic SCCS Weave File Format | |
11 | |
12 A SCCS weave file consists of plain text lines. Each line is either a | |
13 special instruction starting with "^A" or part of the content of the real | |
14 file the weave tracks. There are 3 important operations, where REV denotes | |
15 the revision number: | |
16 | |
17 ^AI REV, marking the beginning of an insertion block introduced by REV | |
18 ^AD REV, marking the beginning of a deletion block introduced by REV | |
19 ^AE REV, marking the end of the block started by "^AI REV" or "^AD REV" | |
20 | |
21 Note on revision numbers: For any two different revision numbers, one must | |
22 be an ancestor of the other to make them comparable. This enforces linear | |
23 history. Besides, the comparison functions (">=", "<") should be efficient. | |
24 This means, if revisions are strings like git or hg, an external map is | |
25 required to convert them into integers. | |
26 | |
27 For example, to represent the following changes: | |
28 | |
29 REV 1 | REV 2 | REV 3 | |
30 ------+-------+------- | |
31 a | a | a | |
32 b | b | 2 | |
33 c | 1 | c | |
34 | 2 | | |
35 | c | | |
36 | |
37 A possible weave file looks like: | |
38 | |
39 ^AI 1 | |
40 a | |
41 ^AD 3 | |
42 b | |
43 ^AI 2 | |
44 1 | |
45 ^AE 3 | |
46 2 | |
47 ^AE 2 | |
48 c | |
49 ^AE 1 | |
50 | |
51 An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In | |
52 the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3". | |
53 Therefore we need some extra information for "^AE". The SCCS weave uses a | |
54 revision number. It could also be a boolean value about whether it is an | |
55 insertion or a deletion (see section 0.4). | |
56 | |
57 0.2 Checkout | |
58 | |
59 The "checkout" operation is to retrieve file content at a given revision, | |
60 say X. It's doable by going through the file line by line and: | |
61 | |
62 - If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there | |
63 - If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there | |
64 - Ignore ^AE | |
65 - For normal lines, just output them | |
66 | |
67 0.3 Annotate | |
68 | |
69 The "annotate" operation is to show extra metadata like the revision number | |
70 and the original line number a line comes from. | |
71 | |
72 It's basically just a "Checkout". For the extra metadata, they can be stored | |
73 side by side with the line contents. Alternatively, we can infer the | |
74 revision number from "^AI"s. | |
75 | |
76 Some SCM tools have to calculate diffs on the fly and thus are much slower | |
77 on this operation. | |
78 | |
79 0.4 Tree Structure | |
80 | |
81 The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE" | |
82 blocks can be interleaved. | |
83 | |
84 If we consider insertions and deletions separately, they can form tree | |
85 structures, respectively. | |
86 | |
87 +--- ^AI 1 +--- ^AD 3 | |
88 | +- ^AI 2 | +- ^AD 2 | |
89 | | | | | |
90 | +- ^AE 2 | +- ^AE 2 | |
91 | | | |
92 +--- ^AE 1 +--- ^AE 3 | |
93 | |
94 More specifically, it's possible to build a tree for all insertions, where | |
95 the tree node has the structure "(rev, startline, endline)". "startline" is | |
96 the line number of "^AI" and "endline" is the line number of the matched | |
97 "^AE". The tree will have these properties: | |
98 | |
99 1. child.rev > parent.rev | |
100 2. child.startline > parent.startline | |
101 3. child.endline < parent.endline | |
102 | |
103 A similar tree for all deletions can also be built with the first property | |
104 changed to: | |
105 | |
106 1. child.rev < parent.rev | |
107 | |
108 0.5 Malformed Cases | |
109 | |
110 The following cases are considered malformed in our implementation: | |
111 | |
112 1. Interleaved insertions, or interleaved deletions. | |
113 It can be rewritten to a non-interleaved tree structure. | |
114 | |
115 ^AI/D x ^AI/D x | |
116 ^AI/D y -> ^AI/D y | |
117 ^AE x ^AE y | |
118 ^AE y ^AE x | |
119 | |
120 2. Nested insertions, where the inner one has a smaller revision number. | |
121 It can be rewritten to a non-nested form. | |
122 | |
123 ^AI x + 1 ^AI x + 1 | |
124 ^AI x -> ^AE x + 1 | |
125 ^AE x ^AI x | |
126 ^AE x + 1 ^AE x | |
127 | |
128 3. Insertion or deletion inside another deletion, where the outer deletion | |
129 block has a smaller revision number. | |
130 | |
131 ^AD x ^AD x | |
132 ^AI/D x + 1 -> ^AE x | |
133 ^AE x + 1 ^AI/D x + 1 | |
134 ^AE x ^AE x | |
135 | |
136 Some of them may be valid in other implementations for special purposes. For | |
137 example, to "revive" a previously deleted block in a newer revision. | |
138 | |
139 0.6 Cases Can Be Optimized | |
140 | |
141 It's always better to get things nested. For example, the left is more | |
142 efficient than the right while they represent the same content: | |
143 | |
144 +--- ^AD 2 +- ^AD 1 | |
145 | +- ^AD 1 | LINE A | |
146 | | LINE A +- ^AE 1 | |
147 | +- ^AE 1 +- ^AD 2 | |
148 | LINE B | LINE B | |
149 +--- ^AE 2 +- ^AE 2 | |
150 | |
151 Our implementation sometimes generates the less efficient data. To always | |
152 get the optimal form, it requires extra code complexity that seems unworthy. | |
153 | |
154 0.7 Inefficiency | |
155 | |
156 The file format can be slow because: | |
157 | |
158 - Inserting a new line at position P requires rewriting all data after P. | |
159 - Finding "^AE" requires walking through the content (O(N), where N is the | |
160 number of lines between "^AI/D" and "^AE"). | |
161 | |
162 1. Linelog | |
163 | |
164 The linelog is a binary format that dedicates to speed up mercurial (or | |
165 git)'s "annotate" operation. It's designed to avoid issues mentioned in | |
166 section 0.7. | |
167 | |
168 1.1 Content Stored | |
169 | |
170 Linelog is not another storage for file contents. It only stores line | |
171 numbers and corresponding revision numbers, instead of actual line content. | |
172 This is okay for the "annotate" operation because usually the external | |
173 source is fast to checkout the content of a file at a specific revision. | |
174 | |
175 A typical SCCS weave is also fast on the "grep" operation, which needs | |
176 random accesses to line contents from different revisions of a file. This | |
177 can be slow with linelog's no-line-content design. However we could use | |
178 an extra map ((rev, line num) -> line content) to speed it up. | |
179 | |
180 Note the revision numbers in linelog should be independent from mercurial | |
181 integer revision numbers. There should be some mapping between linelog rev | |
182 and hg hash stored side by side, to make the files reusable after being | |
183 copied to another machine. | |
184 | |
185 1.2 Basic Format | |
186 | |
187 A linelog file consists of "instruction"s. An "instruction" can be either: | |
188 | |
189 - JGE REV ADDR # jump to ADDR if rev >= REV | |
190 - JL REV ADDR # jump to ADDR if rev < REV | |
191 - LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV | |
192 | |
193 For example, here is the example linelog representing the same file with | |
194 3 revisions mentioned in section 0.1: | |
195 | |
196 SCCS | Linelog | |
197 Weave | Addr : Instruction | |
198 ------+------+------------- | |
199 ^AI 1 | 0 : JL 1 8 | |
200 a | 1 : LINE 1 0 | |
201 ^AD 3 | 2 : JGE 3 6 | |
202 b | 3 : LINE 1 1 | |
203 ^AI 2 | 4 : JL 2 7 | |
204 1 | 5 : LINE 2 2 | |
205 ^AE 3 | | |
206 2 | 6 : LINE 2 3 | |
207 ^AE 2 | | |
208 c | 7 : LINE 1 2 | |
209 ^AE 1 | | |
210 | 8 : END | |
211 | |
212 This way, "find ^AE" is O(1) because we just jump there. And we can insert | |
213 new lines without rewriting most part of the file by appending new lines and | |
214 changing a single instruction to jump to them. | |
215 | |
216 The current implementation uses 64 bits for an instruction: The opcode (JGE, | |
217 JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32 | |
218 bits. It also stores the max revision number and buffer size at the first | |
219 64 bits for quick access to these values. | |
220 | |
221 1.3 Comparing with Mercurial's revlog format | |
222 | |
223 Apparently, linelog is very different from revlog: linelog stores rev and | |
224 line numbers, while revlog has line contents and other metadata (like | |
225 parents, flags). However, the revlog format could also be used to store rev | |
226 and line numbers. For example, to speed up the annotate operation, we could | |
227 also pre-calculate annotate results and just store them using the revlog | |
228 format. | |
229 | |
230 Therefore, linelog is actually somehow similar to revlog, with the important | |
231 trade-off that it only supports linear history (mentioned in section 0.1). | |
232 Essentially, the differences are: | |
233 | |
234 a) Linelog is full of deltas, while revlog could contain full file | |
235 contents sometimes. So linelog is smaller. Revlog could trade | |
236 reconstruction speed for file size - best case, revlog is as small as | |
237 linelog. | |
238 b) The interleaved delta structure allows skipping large portion of | |
239 uninteresting deltas so linelog's content reconstruction is faster than | |
240 the delta-only version of revlog (however it's possible to construct | |
241 a case where interleaved deltas degrade to plain deltas, so linelog | |
242 worst case would be delta-only revlog). Revlog could trade file size | |
243 for reconstruction speed. | |
244 c) Linelog implicitly maintains the order of all lines it stores. So it | |
245 could dump all the lines from all revisions, with a reasonable order. | |
246 While revlog could also dump all line additions, it requires extra | |
247 computation to figure out the order putting those lines - that's some | |
248 kind of "merge". | |
249 | |
250 "c" makes "hg absorb" easier to implement and makes it possible to do | |
251 "annotate --deleted". |