notes.txt
author mpm@selenic.com
Sun, 22 May 2005 08:13:38 -0800
changeset 137 b45b1b00fc9e
parent 59 2bff7c0ea1d3
child 260 d7ce76d82876
permissions -rw-r--r--
Merge from hgweb

Some notes about Mercurial's design

Revlogs:

The fundamental storage type in Mercurial is a "revlog". A revlog is
the set of all revisions to a file. Each revision is either stored
compressed in its entirety or as a compressed binary delta against the
previous version. The decision of when to store a full version is made
based on how much data would be needed to reconstruct the file. This
lets us ensure that we never need to read huge amounts of data to
reconstruct a file, regardless of how many revisions of it we store.

In fact, we should always be able to do it with a single read,
provided we know when and where to read. This is where the index comes
in. Each revlog has an index containing a special hash (nodeid) of the
text, hashes for its parents, and where and how much of the revlog
data we need to read to reconstruct it. Thus, with one read of the
index and one read of the data, we can reconstruct any version in time
proportional to the file size.

Similarly, revlogs and their indices are append-only. This means that
adding a new version is also O(1) seeks.

Generally revlogs are used to represent revisions of files, but they
also are used to represent manifests and changesets.

Manifests:

A manifest is simply a list of all files in a given revision of a
project along with the nodeids of the corresponding file revisions. So
grabbing a given version of the project means simply looking up its
manifest and reconstruction all the file revisions pointed to by it.

Changesets:

A changeset is a list of all files changed in a check-in along with a
change description and some metadata like user and date. It also
contains a nodeid to the relevent revision of the manifest. Changesets
and manifests are one-to-one, but contain different data for
convenience.

Nodeids:

Nodeids are unique ids that are used to represent the contents of a
file AND its position in the project history. That is, if you change a
file and then change it back, the result will have a different nodeid
because it has different history. This is accomplished by including
the parents of a given revision's nodeids with the revision's text
when calculating the hash.

Graph merging:

Nodeids are implemented as they are to simplify merging. Merging a
pair of directed acyclic graphs (aka "the family tree" of the file
history) requires some method of determining if nodes in different
graphs correspond. Simply comparing the contents of the node (by
comparing text of given revisions or their hashes) can get confused by
identical revisions in the tree.

The nodeid approach makes it trivial - the hash uniquely describes a
revision's contents and its graph position relative to the root, so
merge is simply checking whether each nodeid in graph A is in the hash
table of graph B. If not, we pull them in, adding them sequentially to
the revlog.

Graph resolving:

Mercurial does branching by copying (or COWing) a repository and thus
keeps everything nice and linear within a repository. However, when a
merge of repositories (a "pull") is done, we may often have two head
revisions in a given graph. To keep things simple, Mercurial forces
the head revisions to be merged.

It first finds the closest common ancestor of the two heads. If one is
a child of the other, it becomes the new head. Otherwise, we call out
to a user-specified 3-way merge tool.

Merging files, manifests, and changesets:

We begin by comparing changeset DAGs, pulling all nodes we don't have
in our DAG from the other repository. As we do so, we collect a list
of changed files to merge.

Then for each file, we perform a graph merge and resolve as above.
It's important to merge files using per-file DAGs rather than just
changeset level DAGs as this diagram illustrates: 

M   M1   M2

AB
 |`-------v     M2 clones M
aB       AB     file A is change in mainline
 |`---v  AB'    file B is changed in M2
 |   aB / |     M1 clones M
 |   ab/  |     M1 changes B
 |   ab'  |     M1 merges from M2, changes to B conflict
 |    |  A'B'   M2 changes A
  `---+--.|
      |  a'B'   M2 merges from mainline, changes to A conflict
      `--.|
         ???    depending on which ancestor we choose, we will have
	        to redo A hand-merge, B hand-merge, or both
                but if we look at the files independently, everything
		is fine

After we've merged files, we merge the manifest log DAG and resolve
additions and deletions. Then we are ready to resolve the changeset
DAG - if our merge required any changes (the new head is not a
decendent of our tip), we must create a new changeset describing all
of the changes needed to merge it into the tip.

Merge performance:

The I/O operations for performing a merge are O(changed files), not
O(total changes) and in many cases, we needn't even unpack the deltas
to add them to our repository (though this optimization isn't
necessary).

Rollback:

When performing a commit or a merge, we order things so that the
changeset entry gets added last. We keep a transaction log of the name
of each file touched and its length prior to the transaction. On
abort, we simply truncate each file to its prior length. This is one
of the nice properties of the append-only structure of the revlogs.

Remote access:

Mercurial currently supports pulling from "serverless" repositories.
Simply making the repo directory accessibly via the web and pointing
hg at it can accomplish a pull. This is relatively bandwidth efficient
but no effort has been spent on pipelining, so it won't work
especially well over LAN yet.

It's also quite amenable to rsync, if you don't mind keeping an intact
copy of the master around locally.

Also note the append-only and ordering properties of the commit
guarantee that readers will always see a repository in a consistent
state and no special locking is necessary. As there is generally only
one writer to an hg repository, there is in fact no exclusion
implemented yet. 


Some comparisons to git:

Most notably, Mercurial uses delta compression and repositories
created with it will grow much more slowly over time. This also allows
it to be much more bandwidth efficient. I expect repos sizes and sync
speeds to be similar to or better than BK, given the use of binary diffs.

Mercurial is roughly the same performance as git in some areas and is
faster in others as it keeps around more metadata. One example is
listing and retrieving past versions of a file, which it can do
without reading all the changesets. This metadata will also allow it
to perform better merges as described above.