1 Some notes about Mercurial's design |
|
2 |
|
3 Revlogs: |
|
4 |
|
5 The fundamental storage type in Mercurial is a "revlog". A revlog is |
|
6 the set of all revisions to a file. Each revision is either stored |
|
7 compressed in its entirety or as a compressed binary delta against the |
|
8 previous version. The decision of when to store a full version is made |
|
9 based on how much data would be needed to reconstruct the file. This |
|
10 lets us ensure that we never need to read huge amounts of data to |
|
11 reconstruct a file, regardless of how many revisions of it we store. |
|
12 |
|
13 In fact, we should always be able to do it with a single read, |
|
14 provided we know when and where to read. This is where the index comes |
|
15 in. Each revlog has an index containing a special hash (nodeid) of the |
|
16 text, hashes for its parents, and where and how much of the revlog |
|
17 data we need to read to reconstruct it. Thus, with one read of the |
|
18 index and one read of the data, we can reconstruct any version in time |
|
19 proportional to the file size. |
|
20 |
|
21 Similarly, revlogs and their indices are append-only. This means that |
|
22 adding a new version is also O(1) seeks. |
|
23 |
|
24 Generally revlogs are used to represent revisions of files, but they |
|
25 also are used to represent manifests and changesets. |
|
26 |
|
27 Manifests: |
|
28 |
|
29 A manifest is simply a list of all files in a given revision of a |
|
30 project along with the nodeids of the corresponding file revisions. So |
|
31 grabbing a given version of the project means simply looking up its |
|
32 manifest and reconstruction all the file revisions pointed to by it. |
|
33 |
|
34 Changesets: |
|
35 |
|
36 A changeset is a list of all files changed in a check-in along with a |
|
37 change description and some metadata like user and date. It also |
|
38 contains a nodeid to the relevent revision of the manifest. Changesets |
|
39 and manifests are one-to-one, but contain different data for |
|
40 convenience. |
|
41 |
|
42 Nodeids: |
|
43 |
|
44 Nodeids are unique ids that are used to represent the contents of a |
|
45 file AND its position in the project history. That is, if you change a |
|
46 file and then change it back, the result will have a different nodeid |
|
47 because it has different history. This is accomplished by including |
|
48 the parents of a given revision's nodeids with the revision's text |
|
49 when calculating the hash. |
|
50 |
|
51 Graph merging: |
|
52 |
|
53 Nodeids are implemented as they are to simplify merging. Merging a |
|
54 pair of directed acyclic graphs (aka "the family tree" of the file |
|
55 history) requires some method of determining if nodes in different |
|
56 graphs correspond. Simply comparing the contents of the node (by |
|
57 comparing text of given revisions or their hashes) can get confused by |
|
58 identical revisions in the tree. |
|
59 |
|
60 The nodeid approach makes it trivial - the hash uniquely describes a |
|
61 revision's contents and its graph position relative to the root, so |
|
62 merge is simply checking whether each nodeid in graph A is in the hash |
|
63 table of graph B. If not, we pull them in, adding them sequentially to |
|
64 the revlog. |
|
65 |
|
66 Branching and merging: |
|
67 |
|
68 Everything in Mercurial is potentially a branch and every user |
|
69 effectively works in their own branch. When you do a checkout, |
|
70 Mercurial remembers what the parent changeset was and uses it for the |
|
71 next check in. |
|
72 |
|
73 To do a merge of branches in Mercurial, you check out the heads of the |
|
74 two branches into the same working directory which causes a merge to |
|
75 be performed, and then check in the result once you're happy with it. |
|
76 The resulting checkin will have two parents. |
|
77 |
|
78 It decides when a merge is necessary by first determining if there are |
|
79 any uncommitted changes in the working directory. This effectively |
|
80 makes the working directory a branch off the checked in version it's |
|
81 based on. Then it also determines if the working directory is a direct |
|
82 ancestor or descendent of the second version we're attempting to |
|
83 checkout. If neither is true, we simply replace the working directory |
|
84 version with the new version. Otherwise we perform a merge between the |
|
85 two versions. |
|
86 |
|
87 Merging files and manifests: |
|
88 |
|
89 We begin by comparing two versions manifests and deciding which files |
|
90 need to be added, deleted, and merged. |
|
91 |
|
92 Then for each file, we perform a graph merge and resolve as above. |
|
93 It's important to merge files using per-file DAGs rather than just |
|
94 changeset level DAGs as this diagram illustrates: |
|
95 |
|
96 M M1 M2 |
|
97 |
|
98 AB |
|
99 |`-------v M2 clones M |
|
100 aB AB file A is change in mainline |
|
101 |`---v AB' file B is changed in M2 |
|
102 | aB / | M1 clones M |
|
103 | ab/ | M1 changes B |
|
104 | ab' | M1 merges from M2, changes to B conflict |
|
105 | | A'B' M2 changes A |
|
106 `---+--.| |
|
107 | a'B' M2 merges from mainline, changes to A conflict |
|
108 `--.| |
|
109 ??? depending on which ancestor we choose, we will have |
|
110 to redo A hand-merge, B hand-merge, or both |
|
111 but if we look at the files independently, everything |
|
112 is fine |
|
113 |
|
114 The result is a merged version in the working directory, waiting for |
|
115 check-in. |
|
116 |
|
117 Rollback: |
|
118 |
|
119 When performing a commit or a merge, we order things so that the |
|
120 changeset entry gets added last. We keep a transaction log of the name |
|
121 of each file touched and its length prior to the transaction. On |
|
122 abort, we simply truncate each file to its prior length. This is one |
|
123 of the nice properties of the append-only structure of the revlogs. |
|
124 We can also reuse this journal for "rollback". |
|
125 |
|
126 Merging between repositories: |
|
127 |
|
128 One of the key features of Mercurial is the ability to merge between |
|
129 independent repositories in a decentralized fashion. Each repository |
|
130 can act as a read-only server or a client. Clients operating by |
|
131 pulling all branches that it hasn't seen from the server and adding |
|
132 them into its graph. This is done in two steps: searching for new |
|
133 "roots" and pulling a "changegroup" |
|
134 |
|
135 Searching for new "roots" begins by finding all new heads and |
|
136 searching backwards from those heads to the first unknown nodes in |
|
137 their respective branches. These nodes are the 'roots' that are used |
|
138 to calculate the 'changegroup': the set of all changesets starting at |
|
139 those roots. Mercurial takes pains to make this search efficient in |
|
140 both bandwidth and round-trips. |
|
141 |
|
142 Once the roots are found, the changegroup can be transferred as a |
|
143 single streaming transfer. This is organized as an ordered set of |
|
144 deltas for changesets, manifests, and files. Large chunks of deltas |
|
145 can be directly added to the repository without unpacking so it's |
|
146 fairly fast. |
|