Mercurial > hg
comparison mercurial/helptext/internals/dirstate-v2.txt @ 48166:e8a576de703f
dirstate-v2: Add internal documentation
It can be viewed by running `hg help internals.dirstate-v2`
Since that command rewraps paragraphs, the source text is written with
semantic line breaks: https://sembr.org/
Differential Revision: https://phab.mercurial-scm.org/D11546
author | Simon Sapin <simon.sapin@octobus.net> |
---|---|
date | Fri, 01 Oct 2021 12:17:09 +0200 |
parents | |
children | eb8092f9304f |
comparison
equal
deleted
inserted
replaced
48165:d467e44f71d7 | 48166:e8a576de703f |
---|---|
1 The *dirstate* is what Mercurial uses internally to track | |
2 the state of files in the working directory, | |
3 such as set by commands like `hg add` and `hg rm`. | |
4 It also contains some cached data that help make `hg status` faster. | |
5 The name refers both to `.hg/dirstate` on the filesystem | |
6 and the corresponding data structure in memory while a Mercurial process | |
7 is running. | |
8 | |
9 The original file format, retroactively dubbed `dirstate-v1`, | |
10 is described at https://www.mercurial-scm.org/wiki/DirState. | |
11 It is made of a flat sequence of unordered variable-size entries, | |
12 so accessing any information in it requires parsing all of it. | |
13 Similarly, saving changes requires rewriting the entire file. | |
14 | |
15 The newer `dirsate-v2` file format is designed to fix these limitations | |
16 and make `hg status` faster. | |
17 | |
18 User guide | |
19 ========== | |
20 | |
21 Compatibility | |
22 ------------- | |
23 | |
24 The file format is experimental and may still change. | |
25 Different versions of Mercurial may not be compatible with each other | |
26 when working on a local repository that uses this format. | |
27 When using an incompatible version with the experimental format, | |
28 anything can happen including data corruption. | |
29 | |
30 Since the dirstate is entirely local and not relevant to the wire protocol, | |
31 `dirstate-v2` does not affect compatibility with remote Mercurial versions. | |
32 | |
33 When `share-safe` is enabled, different repositories sharing the same store | |
34 can use different dirstate formats. | |
35 | |
36 Enabling `dirsate-v2` for new local repositories | |
37 ------------------------------------------------ | |
38 | |
39 When creating a new local repository such as with `hg init` or `hg clone`, | |
40 the `exp-dirstate-v2` boolean in the `format` configuration section | |
41 controls whether to use this file format. | |
42 This is disabled by default as of this writing. | |
43 To enable it for a single repository, run for example:: | |
44 | |
45 $ hg init my-project --config format.exp-dirstate-v2=1 | |
46 | |
47 Checking the format of an existing local repsitory | |
48 -------------------------------------------------- | |
49 | |
50 The `debugformat` commands prints information about | |
51 which of multiple optional formats are used in the current repository, | |
52 including `dirstate-v2`:: | |
53 | |
54 $ hg debugformat | |
55 format-variant repo | |
56 fncache: yes | |
57 dirstate-v2: yes | |
58 […] | |
59 | |
60 Upgrading or downgrading an existing local repository | |
61 ----------------------------------------------------- | |
62 | |
63 The `debugupgrade` command does various upgrades or downgrades | |
64 on a local repository | |
65 based on the current Mercurial version and on configuration. | |
66 The same `format.exp-dirstate-v2` configuration is used again. | |
67 | |
68 Example to upgrade:: | |
69 | |
70 $ hg debugupgrade --config format.exp-dirstate-v2=1 | |
71 | |
72 Example to downgrade to `dirstate-v1`:: | |
73 | |
74 $ hg debugupgrade --config format.exp-dirstate-v2=0 | |
75 | |
76 Both of this commands do nothing but print a list of proposed changes, | |
77 which may include changes unrelated to the dirstate. | |
78 Those other changes are controlled by their own configuration keys. | |
79 Add `--run` to a command to actually apply the proposed changes. | |
80 | |
81 Backups of `.hg/requires` and `.hg/dirstate` are created | |
82 in a `.hg/upgradebackup.*` directory. | |
83 If something goes wrong, restoring those files should undo the change. | |
84 | |
85 Note that upgrading affects compatibility with older versions of Mercurial | |
86 as noted above. | |
87 This can be relevant when a repository’s files are on a USB drive | |
88 or some other removable media, or shared over the network, etc. | |
89 | |
90 Internal filesystem representation | |
91 ================================== | |
92 | |
93 Requirements file | |
94 ----------------- | |
95 | |
96 The `.hg/requires` file indicates which of various optional file formats | |
97 are used by a given repository. | |
98 Mercurial aborts when seeing a requirement it does not know about, | |
99 which avoids older version accidentally messing up a respository | |
100 that uses a format that was introduced later. | |
101 For versions that do support a format, the presence or absence of | |
102 the corresponding requirement indicates whether to use that format. | |
103 | |
104 When the file contains a `exp-dirstate-v2` line, | |
105 the `dirstate-v2` format is used. | |
106 With no such line `dirstate-v1` is used. | |
107 | |
108 High level description | |
109 ---------------------- | |
110 | |
111 Whereas `dirstate-v1` uses a single `.hg/disrtate` file, | |
112 in `dirstate-v2` that file is a "docket" file | |
113 that only contains some metadata | |
114 and points to separate data file named `.hg/dirstate.{ID}`, | |
115 where `{ID}` is a random identifier. | |
116 | |
117 This separation allows making data files append-only | |
118 and therefore safer to memory-map. | |
119 Creating a new data file (occasionally to clean up unused data) | |
120 can be done with a different ID | |
121 without disrupting another Mercurial process | |
122 that could still be using the previous data file. | |
123 | |
124 Both files have a format designed to reduce the need for parsing, | |
125 by using fixed-size binary components as much as possible. | |
126 For data that is not fixed-size, | |
127 references to other parts of a file can be made by storing "pseudo-pointers": | |
128 integers counted in bytes from the start of a file. | |
129 For read-only access no data structure is needed, | |
130 only a bytes buffer (possibly memory-mapped directly from the filesystem) | |
131 with specific parts read on demand. | |
132 | |
133 The data file contains "nodes" organized in a tree. | |
134 Each node represents a file or directory inside the working directory | |
135 or its parent changeset. | |
136 This tree has the same structure as the filesystem, | |
137 so a node representing a directory has child nodes representing | |
138 the files and subdirectories contained directly in that directory. | |
139 | |
140 The docket file format | |
141 ---------------------- | |
142 | |
143 This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs` | |
144 and `mercurial/dirstateutils/docket.py`. | |
145 | |
146 Components of the docket file are found at fixed offsets, | |
147 counted in bytes from the start of the file: | |
148 | |
149 * Offset 0: | |
150 The 12-bytes marker string "dirstate-v2\n" ending with a newline character. | |
151 This makes it easier to tell a dirstate-v2 file from a dirstate-v1 file, | |
152 although it is not strictly necessary | |
153 since `.hg/requires` determines which format to use. | |
154 | |
155 * Offset 12: | |
156 The changeset node ID on the first parent of the working directory, | |
157 as up to 32 binary bytes. | |
158 If a node ID is shorter (20 bytes for SHA-1), | |
159 it is start-aligned and the rest of the bytes are set to zero. | |
160 | |
161 * Offset 44: | |
162 The changeset node ID on the second parent of the working directory, | |
163 or all zeros if there isn’t one. | |
164 Also 32 binary bytes. | |
165 | |
166 * Offset 76: | |
167 Tree metadata on 44 bytes, described below. | |
168 Its separation in this documentation from the rest of the docket | |
169 reflects a detail of the current implementation. | |
170 Since tree metadata is also made of fields at fixed offsets, those could | |
171 be inlined here by adding 76 bytes to each offset. | |
172 | |
173 * Offset 120: | |
174 The used size of the data file, as a 32-bit big-endian integer. | |
175 The actual size of the data file may be larger | |
176 (if another Mercurial processis in appending to it | |
177 but has not updated the docket yet). | |
178 That extra data must be ignored. | |
179 | |
180 * Offset 124: | |
181 The length of the data file identifier, as a 8-bit integer. | |
182 | |
183 * Offset 125: | |
184 The data file identifier. | |
185 | |
186 * Any additional data is current ignored, and dropped when updating the file. | |
187 | |
188 Tree metadata in the docket file | |
189 -------------------------------- | |
190 | |
191 Tree metadata is similarly made of components at fixed offsets. | |
192 These offsets are counted in bytes from the start of tree metadata, | |
193 which is 76 bytes after the start of the docket file. | |
194 | |
195 This metadata can be thought of as the singular root of the tree | |
196 formed by nodes in the data file. | |
197 | |
198 * Offset 0: | |
199 Pseudo-pointer to the start of root nodes, | |
200 counted in bytes from the start of the data file, | |
201 as a 32-bit big-endian integer. | |
202 These nodes describe files and directories found directly | |
203 at the root of the working directory. | |
204 | |
205 * Offset 4: | |
206 Number of root nodes, as a 32-bit big-endian integer. | |
207 | |
208 * Offset 8: | |
209 Total number of nodes in the entire tree that "have a dirstate entry", | |
210 as a 32-bit big-endian integer. | |
211 Those nodes represent files that would be present at all in `dirstate-v1`. | |
212 This is typically less than the total number of nodes. | |
213 This counter is used to implement `len(dirstatemap)`. | |
214 | |
215 * Offset 12: | |
216 Number of nodes in the entire tree that have a copy source, | |
217 as a 32-bit big-endian integer. | |
218 At the next commit, these files are recorded | |
219 as having been copied or moved/renamed from that source. | |
220 (A move is recorded as a copy and separate removal of the source.) | |
221 This counter is used to implement `len(dirstatemap.copymap)`. | |
222 | |
223 * Offset 16: | |
224 An estimation of how many bytes of the data file | |
225 (within its used size) are unused, as a 32-bit big-endian integer. | |
226 When appending to an existing data file, | |
227 some existing nodes or paths can be unreachable from the new root | |
228 but they still take up space. | |
229 This counter is used to decide when to write a new data file from scratch | |
230 instead of appending to an existing one, | |
231 in order to get rid of that unreachable data | |
232 and avoid unbounded file size growth. | |
233 | |
234 * Offset 20: | |
235 These four bytes are currently ignored | |
236 and reset to zero when updating a docket file. | |
237 This is an attempt at forward compatibility: | |
238 future Mercurial versions could use this as a bit field | |
239 to indicate that a dirstate has additional data or constraints. | |
240 Finding a dirstate file with the relevant bit unset indicates that | |
241 it was written by a then-older version | |
242 which is not aware of that future change. | |
243 | |
244 * Offset 24: | |
245 Either 20 zero bytes, or a SHA-1 hash as 20 binary bytes. | |
246 When present, the hash is of ignore patterns | |
247 that were used for some previous run of the `status` algorithm. | |
248 | |
249 * (Offset 44: end of tree metadata) | |
250 | |
251 Optional hash of ignore patterns | |
252 -------------------------------- | |
253 | |
254 The implementation of `status` at `rust/hg-core/src/dirstate_tree/status.rs` | |
255 has been optimized such that its run time is dominated by calls | |
256 to `stat` for reading the filesystem metadata of a file or directory, | |
257 and to `readdir` for listing the contents of a directory. | |
258 In some cases the algorithm can skip calls to `readdir` | |
259 (saving significant time) | |
260 because the dirstate already contains enough of the relevant information | |
261 to build the correct `status` results. | |
262 | |
263 The default configuration of `hg status` is to list unknown files | |
264 but not ignored files. | |
265 In this case, it matters for the `readdir`-skipping optimization | |
266 if a given file used to be ignored but became unknown | |
267 because `.hgignore` changed. | |
268 To detect the possibility of such a change, | |
269 the tree metadata contains an optional hash of all ignore patterns. | |
270 | |
271 We define: | |
272 | |
273 * "Root" ignore files as: | |
274 | |
275 - `.hgignore` at the root of the repository if it exists | |
276 - And all files from `ui.ignore.*` config. | |
277 | |
278 This set of files is sorted by the string representation of their path. | |
279 | |
280 * The "expanded contents" of an ignore files is the byte string made | |
281 by the concatenation of its contents followed by the "expanded contents" | |
282 of other files included with `include:` or `subinclude:` directives, | |
283 in inclusion order. This definition is recursive, as included files can | |
284 themselves include more files. | |
285 | |
286 This hash is defined as the SHA-1 of the concatenation (in sorted | |
287 order) of the "expanded contents" of each "root" ignore file. | |
288 (Note that computing this does not require actually concatenating byte ranges into | |
289 contiguous memory. | |
290 Instead a SHA-1 hasher object can be created and fed separate byte ranges one by | |
291 one.) | |
292 | |
293 The data file format | |
294 -------------------- | |
295 | |
296 This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs` | |
297 and `mercurial/dirstateutils/v2.py`. | |
298 | |
299 The data file contains two types of data: paths and nodes. | |
300 | |
301 Paths and nodes can be organized in any order in the file, except that sibling | |
302 nodes must be next to each other and sorted by their path. Contiguity lets | |
303 the parent refer to them all by their count with a single pseudo-pointer, | |
304 instead of storing one pseudo-pointer per child node. Sorting allows using | |
305 binary seach to find a child node with a given name in `O(log(n))` byte ranges | |
306 comparisons. | |
307 | |
308 The current implemention writes paths and child node before a given node | |
309 for ease of figuring out the value of pseudo-pointers by the time the are to be | |
310 written, but this is not an obligation and readers must not rely on it. | |
311 | |
312 A path is stored as a byte string anywhere in the file, without delimiter. | |
313 It is refered to by one or more node by a pseudo-pointer to its start, and its | |
314 length in bytes. Since there is no delimiter, | |
315 when a path is a substring of another the same bytes could be reused, | |
316 although the implementation does not exploit this as of this writing. | |
317 | |
318 A node is stored on 43 bytes with components at fixed offsets. Paths and | |
319 child nodes relevant to a node are stored externally and referenced though | |
320 pseudo-pointers. | |
321 | |
322 All integers are stored in big-endian. All pseudo-pointers are 32-bit integers | |
323 counting bytes from the start of the data file. Path lengths and positions | |
324 are 16-bit integers, also counted in bytes. | |
325 | |
326 Node components are: | |
327 | |
328 * Offset 0: | |
329 Pseudo-pointer to the full path of this node, | |
330 from the working directory root. | |
331 | |
332 * Offset 4: | |
333 Length of the full path. | |
334 | |
335 * Offset 6: | |
336 Position of the last `/` path separator within the full path, | |
337 in bytes from the start of the full path, | |
338 or zero if there isn’t one. | |
339 The part of the full path after this position is the "base name". | |
340 Since sibling nodes have the same parent, only their base name vary | |
341 and needs to be considered when doing binary search to find a given path. | |
342 | |
343 * Offset 8: | |
344 Pseudo-pointer to the "copy source" path for this node, | |
345 or zero if there is no copy source. | |
346 | |
347 * Offset 12: | |
348 Length of the copy source path, or zero if there isn’t one. | |
349 | |
350 * Offset 14: | |
351 Pseudo-pointer to the start of child nodes. | |
352 | |
353 * Offset 18: | |
354 Number of child nodes, as a 32-bit integer. | |
355 They occupy 43 times this number of bytes | |
356 (not counting space for paths, and further descendants). | |
357 | |
358 * Offset 22: | |
359 Number as a 32-bit integer of descendant nodes in this subtree, | |
360 not including this node itself, | |
361 that "have a dirstate entry". | |
362 Those nodes represent files that would be present at all in `dirstate-v1`. | |
363 This is typically less than the total number of descendants. | |
364 This counter is used to implement `has_dir`. | |
365 | |
366 * Offset 26: | |
367 Number as a 32-bit integer of descendant nodes in this subtree, | |
368 not including this node itself, | |
369 that represent files tracked in the working directory. | |
370 (For example, `hg rm` makes a file untracked.) | |
371 This counter is used to implement `has_tracked_dir`. | |
372 | |
373 * Offset 30 and more: | |
374 **TODO:** docs not written yet | |
375 as this part of the format might be changing soon. |