comparison mercurial/helptext/internals/dirstate-v2.txt @ 48166:e8a576de703f

dirstate-v2: Add internal documentation It can be viewed by running `hg help internals.dirstate-v2` Since that command rewraps paragraphs, the source text is written with semantic line breaks: https://sembr.org/ Differential Revision: https://phab.mercurial-scm.org/D11546
author Simon Sapin <simon.sapin@octobus.net>
date Fri, 01 Oct 2021 12:17:09 +0200
parents
children eb8092f9304f
comparison
equal deleted inserted replaced
48165:d467e44f71d7 48166:e8a576de703f
1 The *dirstate* is what Mercurial uses internally to track
2 the state of files in the working directory,
3 such as set by commands like `hg add` and `hg rm`.
4 It also contains some cached data that help make `hg status` faster.
5 The name refers both to `.hg/dirstate` on the filesystem
6 and the corresponding data structure in memory while a Mercurial process
7 is running.
8
9 The original file format, retroactively dubbed `dirstate-v1`,
10 is described at https://www.mercurial-scm.org/wiki/DirState.
11 It is made of a flat sequence of unordered variable-size entries,
12 so accessing any information in it requires parsing all of it.
13 Similarly, saving changes requires rewriting the entire file.
14
15 The newer `dirsate-v2` file format is designed to fix these limitations
16 and make `hg status` faster.
17
18 User guide
19 ==========
20
21 Compatibility
22 -------------
23
24 The file format is experimental and may still change.
25 Different versions of Mercurial may not be compatible with each other
26 when working on a local repository that uses this format.
27 When using an incompatible version with the experimental format,
28 anything can happen including data corruption.
29
30 Since the dirstate is entirely local and not relevant to the wire protocol,
31 `dirstate-v2` does not affect compatibility with remote Mercurial versions.
32
33 When `share-safe` is enabled, different repositories sharing the same store
34 can use different dirstate formats.
35
36 Enabling `dirsate-v2` for new local repositories
37 ------------------------------------------------
38
39 When creating a new local repository such as with `hg init` or `hg clone`,
40 the `exp-dirstate-v2` boolean in the `format` configuration section
41 controls whether to use this file format.
42 This is disabled by default as of this writing.
43 To enable it for a single repository, run for example::
44
45 $ hg init my-project --config format.exp-dirstate-v2=1
46
47 Checking the format of an existing local repsitory
48 --------------------------------------------------
49
50 The `debugformat` commands prints information about
51 which of multiple optional formats are used in the current repository,
52 including `dirstate-v2`::
53
54 $ hg debugformat
55 format-variant repo
56 fncache: yes
57 dirstate-v2: yes
58 […]
59
60 Upgrading or downgrading an existing local repository
61 -----------------------------------------------------
62
63 The `debugupgrade` command does various upgrades or downgrades
64 on a local repository
65 based on the current Mercurial version and on configuration.
66 The same `format.exp-dirstate-v2` configuration is used again.
67
68 Example to upgrade::
69
70 $ hg debugupgrade --config format.exp-dirstate-v2=1
71
72 Example to downgrade to `dirstate-v1`::
73
74 $ hg debugupgrade --config format.exp-dirstate-v2=0
75
76 Both of this commands do nothing but print a list of proposed changes,
77 which may include changes unrelated to the dirstate.
78 Those other changes are controlled by their own configuration keys.
79 Add `--run` to a command to actually apply the proposed changes.
80
81 Backups of `.hg/requires` and `.hg/dirstate` are created
82 in a `.hg/upgradebackup.*` directory.
83 If something goes wrong, restoring those files should undo the change.
84
85 Note that upgrading affects compatibility with older versions of Mercurial
86 as noted above.
87 This can be relevant when a repository’s files are on a USB drive
88 or some other removable media, or shared over the network, etc.
89
90 Internal filesystem representation
91 ==================================
92
93 Requirements file
94 -----------------
95
96 The `.hg/requires` file indicates which of various optional file formats
97 are used by a given repository.
98 Mercurial aborts when seeing a requirement it does not know about,
99 which avoids older version accidentally messing up a respository
100 that uses a format that was introduced later.
101 For versions that do support a format, the presence or absence of
102 the corresponding requirement indicates whether to use that format.
103
104 When the file contains a `exp-dirstate-v2` line,
105 the `dirstate-v2` format is used.
106 With no such line `dirstate-v1` is used.
107
108 High level description
109 ----------------------
110
111 Whereas `dirstate-v1` uses a single `.hg/disrtate` file,
112 in `dirstate-v2` that file is a "docket" file
113 that only contains some metadata
114 and points to separate data file named `.hg/dirstate.{ID}`,
115 where `{ID}` is a random identifier.
116
117 This separation allows making data files append-only
118 and therefore safer to memory-map.
119 Creating a new data file (occasionally to clean up unused data)
120 can be done with a different ID
121 without disrupting another Mercurial process
122 that could still be using the previous data file.
123
124 Both files have a format designed to reduce the need for parsing,
125 by using fixed-size binary components as much as possible.
126 For data that is not fixed-size,
127 references to other parts of a file can be made by storing "pseudo-pointers":
128 integers counted in bytes from the start of a file.
129 For read-only access no data structure is needed,
130 only a bytes buffer (possibly memory-mapped directly from the filesystem)
131 with specific parts read on demand.
132
133 The data file contains "nodes" organized in a tree.
134 Each node represents a file or directory inside the working directory
135 or its parent changeset.
136 This tree has the same structure as the filesystem,
137 so a node representing a directory has child nodes representing
138 the files and subdirectories contained directly in that directory.
139
140 The docket file format
141 ----------------------
142
143 This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`
144 and `mercurial/dirstateutils/docket.py`.
145
146 Components of the docket file are found at fixed offsets,
147 counted in bytes from the start of the file:
148
149 * Offset 0:
150 The 12-bytes marker string "dirstate-v2\n" ending with a newline character.
151 This makes it easier to tell a dirstate-v2 file from a dirstate-v1 file,
152 although it is not strictly necessary
153 since `.hg/requires` determines which format to use.
154
155 * Offset 12:
156 The changeset node ID on the first parent of the working directory,
157 as up to 32 binary bytes.
158 If a node ID is shorter (20 bytes for SHA-1),
159 it is start-aligned and the rest of the bytes are set to zero.
160
161 * Offset 44:
162 The changeset node ID on the second parent of the working directory,
163 or all zeros if there isn’t one.
164 Also 32 binary bytes.
165
166 * Offset 76:
167 Tree metadata on 44 bytes, described below.
168 Its separation in this documentation from the rest of the docket
169 reflects a detail of the current implementation.
170 Since tree metadata is also made of fields at fixed offsets, those could
171 be inlined here by adding 76 bytes to each offset.
172
173 * Offset 120:
174 The used size of the data file, as a 32-bit big-endian integer.
175 The actual size of the data file may be larger
176 (if another Mercurial processis in appending to it
177 but has not updated the docket yet).
178 That extra data must be ignored.
179
180 * Offset 124:
181 The length of the data file identifier, as a 8-bit integer.
182
183 * Offset 125:
184 The data file identifier.
185
186 * Any additional data is current ignored, and dropped when updating the file.
187
188 Tree metadata in the docket file
189 --------------------------------
190
191 Tree metadata is similarly made of components at fixed offsets.
192 These offsets are counted in bytes from the start of tree metadata,
193 which is 76 bytes after the start of the docket file.
194
195 This metadata can be thought of as the singular root of the tree
196 formed by nodes in the data file.
197
198 * Offset 0:
199 Pseudo-pointer to the start of root nodes,
200 counted in bytes from the start of the data file,
201 as a 32-bit big-endian integer.
202 These nodes describe files and directories found directly
203 at the root of the working directory.
204
205 * Offset 4:
206 Number of root nodes, as a 32-bit big-endian integer.
207
208 * Offset 8:
209 Total number of nodes in the entire tree that "have a dirstate entry",
210 as a 32-bit big-endian integer.
211 Those nodes represent files that would be present at all in `dirstate-v1`.
212 This is typically less than the total number of nodes.
213 This counter is used to implement `len(dirstatemap)`.
214
215 * Offset 12:
216 Number of nodes in the entire tree that have a copy source,
217 as a 32-bit big-endian integer.
218 At the next commit, these files are recorded
219 as having been copied or moved/renamed from that source.
220 (A move is recorded as a copy and separate removal of the source.)
221 This counter is used to implement `len(dirstatemap.copymap)`.
222
223 * Offset 16:
224 An estimation of how many bytes of the data file
225 (within its used size) are unused, as a 32-bit big-endian integer.
226 When appending to an existing data file,
227 some existing nodes or paths can be unreachable from the new root
228 but they still take up space.
229 This counter is used to decide when to write a new data file from scratch
230 instead of appending to an existing one,
231 in order to get rid of that unreachable data
232 and avoid unbounded file size growth.
233
234 * Offset 20:
235 These four bytes are currently ignored
236 and reset to zero when updating a docket file.
237 This is an attempt at forward compatibility:
238 future Mercurial versions could use this as a bit field
239 to indicate that a dirstate has additional data or constraints.
240 Finding a dirstate file with the relevant bit unset indicates that
241 it was written by a then-older version
242 which is not aware of that future change.
243
244 * Offset 24:
245 Either 20 zero bytes, or a SHA-1 hash as 20 binary bytes.
246 When present, the hash is of ignore patterns
247 that were used for some previous run of the `status` algorithm.
248
249 * (Offset 44: end of tree metadata)
250
251 Optional hash of ignore patterns
252 --------------------------------
253
254 The implementation of `status` at `rust/hg-core/src/dirstate_tree/status.rs`
255 has been optimized such that its run time is dominated by calls
256 to `stat` for reading the filesystem metadata of a file or directory,
257 and to `readdir` for listing the contents of a directory.
258 In some cases the algorithm can skip calls to `readdir`
259 (saving significant time)
260 because the dirstate already contains enough of the relevant information
261 to build the correct `status` results.
262
263 The default configuration of `hg status` is to list unknown files
264 but not ignored files.
265 In this case, it matters for the `readdir`-skipping optimization
266 if a given file used to be ignored but became unknown
267 because `.hgignore` changed.
268 To detect the possibility of such a change,
269 the tree metadata contains an optional hash of all ignore patterns.
270
271 We define:
272
273 * "Root" ignore files as:
274
275 - `.hgignore` at the root of the repository if it exists
276 - And all files from `ui.ignore.*` config.
277
278 This set of files is sorted by the string representation of their path.
279
280 * The "expanded contents" of an ignore files is the byte string made
281 by the concatenation of its contents followed by the "expanded contents"
282 of other files included with `include:` or `subinclude:` directives,
283 in inclusion order. This definition is recursive, as included files can
284 themselves include more files.
285
286 This hash is defined as the SHA-1 of the concatenation (in sorted
287 order) of the "expanded contents" of each "root" ignore file.
288 (Note that computing this does not require actually concatenating byte ranges into
289 contiguous memory.
290 Instead a SHA-1 hasher object can be created and fed separate byte ranges one by
291 one.)
292
293 The data file format
294 --------------------
295
296 This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`
297 and `mercurial/dirstateutils/v2.py`.
298
299 The data file contains two types of data: paths and nodes.
300
301 Paths and nodes can be organized in any order in the file, except that sibling
302 nodes must be next to each other and sorted by their path. Contiguity lets
303 the parent refer to them all by their count with a single pseudo-pointer,
304 instead of storing one pseudo-pointer per child node. Sorting allows using
305 binary seach to find a child node with a given name in `O(log(n))` byte ranges
306 comparisons.
307
308 The current implemention writes paths and child node before a given node
309 for ease of figuring out the value of pseudo-pointers by the time the are to be
310 written, but this is not an obligation and readers must not rely on it.
311
312 A path is stored as a byte string anywhere in the file, without delimiter.
313 It is refered to by one or more node by a pseudo-pointer to its start, and its
314 length in bytes. Since there is no delimiter,
315 when a path is a substring of another the same bytes could be reused,
316 although the implementation does not exploit this as of this writing.
317
318 A node is stored on 43 bytes with components at fixed offsets. Paths and
319 child nodes relevant to a node are stored externally and referenced though
320 pseudo-pointers.
321
322 All integers are stored in big-endian. All pseudo-pointers are 32-bit integers
323 counting bytes from the start of the data file. Path lengths and positions
324 are 16-bit integers, also counted in bytes.
325
326 Node components are:
327
328 * Offset 0:
329 Pseudo-pointer to the full path of this node,
330 from the working directory root.
331
332 * Offset 4:
333 Length of the full path.
334
335 * Offset 6:
336 Position of the last `/` path separator within the full path,
337 in bytes from the start of the full path,
338 or zero if there isn’t one.
339 The part of the full path after this position is the "base name".
340 Since sibling nodes have the same parent, only their base name vary
341 and needs to be considered when doing binary search to find a given path.
342
343 * Offset 8:
344 Pseudo-pointer to the "copy source" path for this node,
345 or zero if there is no copy source.
346
347 * Offset 12:
348 Length of the copy source path, or zero if there isn’t one.
349
350 * Offset 14:
351 Pseudo-pointer to the start of child nodes.
352
353 * Offset 18:
354 Number of child nodes, as a 32-bit integer.
355 They occupy 43 times this number of bytes
356 (not counting space for paths, and further descendants).
357
358 * Offset 22:
359 Number as a 32-bit integer of descendant nodes in this subtree,
360 not including this node itself,
361 that "have a dirstate entry".
362 Those nodes represent files that would be present at all in `dirstate-v1`.
363 This is typically less than the total number of descendants.
364 This counter is used to implement `has_dir`.
365
366 * Offset 26:
367 Number as a 32-bit integer of descendant nodes in this subtree,
368 not including this node itself,
369 that represent files tracked in the working directory.
370 (For example, `hg rm` makes a file untracked.)
371 This counter is used to implement `has_tracked_dir`.
372
373 * Offset 30 and more:
374 **TODO:** docs not written yet
375 as this part of the format might be changing soon.