diff -r 000000000000 -r 9117c6561b0b notes.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/notes.txt Tue May 03 13:16:10 2005 -0800 @@ -0,0 +1,159 @@ +Some notes about Mercurial's design + +Revlogs: + +The fundamental storage type in Mercurial is a "revlog". A revlog is +the set of all revisions to a file. Each revision is either stored +compressed in its entirety or as a compressed binary delta against the +previous version. The decision of when to store a full version is made +based on how much data would be needed to reconstruct the file. This +lets us ensure that we never need to read huge amounts of data to +reconstruct a file, regardless of how many revisions of it we store. + +In fact, we should always be able to do it with a single read, +provided we know when and where to read. This is where the index comes +in. Each revlog has an index containing a special hash (nodeid) of the +text, hashes for its parents, and where and how much of the revlog +data we need to read to reconstruct it. Thus, with one read of the +index and one read of the data, we can reconstruct any version in time +proportional to the file size. + +Similarly, revlogs and their indices are append-only. This means that +adding a new version is also O(1) seeks. + +Generally revlogs are used to represent revisions of files, but they +also are used to represent manifests and changesets. + +Manifests: + +A manifest is simply a list of all files in a given revision of a +project along with the nodeids of the corresponding file revisions. So +grabbing a given version of the project means simply looking up its +manifest and reconstruction all the file revisions pointed to by it. + +Changesets: + +A changeset is a list of all files changed in a check-in along with a +change description and some metadata like user and date. It also +contains a nodeid to the relevent revision of the manifest. Changesets +and manifests are one-to-one, but contain different data for +convenience. + +Nodeids: + +Nodeids are unique ids that are used to represent the contents of a +file AND its position in the project history. That is, if you change a +file and then change it back, the result will have a different nodeid +because it has different history. This is accomplished by including +the parents of a given revision's nodeids with the revision's text +when calculating the hash. + +Graph merging: + +Nodeids are implemented as they are to simplify merging. Merging a +pair of directed acyclic graphs (aka "the family tree" of the file +history) requires some method of determining if nodes in different +graphs correspond. Simply comparing the contents of the node (by +comparing text of given revisions or their hashes) can get confused by +identical revisions in the tree. + +The nodeid approach makes it trivial - the hash uniquely describes a +revision's contents and its graph position relative to the root, so +merge is simply checking whether each nodeid in graph A is in the hash +table of graph B. If not, we pull them in, adding them sequentially to +the revlog. + +Graph resolving: + +Mercurial does branching by copying (or COWing) a repository and thus +keeps everything nice and linear within a repository. However, when a +merge of repositories (a "pull") is done, we may often have two head +revisions in a given graph. To keep things simple, Mercurial forces +the head revisions to be merged. + +It first finds the closest common ancestor of the two heads. If one is +a child of the other, it becomes the new head. Otherwise, we call out +to a user-specified 3-way merge tool. + +Merging files, manifests, and changesets: + +We begin by comparing changeset DAGs, pulling all nodes we don't have +in our DAG from the other repository. As we do so, we collect a list +of changed files to merge. + +Then for each file, we perform a graph merge and resolve as above. +It's important to merge files using per-file DAGs rather than just +changeset level DAGs as this diagram illustrates: + +M M1 M2 + +AB + |`-------v M2 clones M +aB AB file A is change in mainline + |`---v AB' file B is changed in M2 + | aB / | M1 clones M + | ab/ | M1 changes B + | ab' | M1 merges from M2, changes to B conflict + | | A'B' M2 changes A + `---+--.| + | a'B' M2 merges from mainline, changes to A conflict + `--.| + ??? depending on which ancestor we choose, we will have + to redo A hand-merge, B hand-merge, or both + but if we look at the files independently, everything + is fine + +After we've merged files, we merge the manifest log DAG and resolve +additions and deletions. Then we are ready to resolve the changeset +DAG - if our merge required any changes (the new head is not a +decendent of our tip), we must create a new changeset describing all +of the changes needed to merge it into the tip. + +Merge performance: + +The I/O operations for performing a merge are O(changed files), not +O(total changes) and in many cases, we needn't even unpack the deltas +to add them to our repository (though this optimization isn't +necessary). + +Rollback: + +Rollback is not yet implemented, but will be easy to add. When +performing a commit or a merge, we order things so that the changeset +entry gets added last. We keep a transaction log of the name of each +file and its length prior to the transaction. On abort, we simply +truncate each file to its prior length. This is one of the nice +properties of the append-only structure of the revlogs. + +Remote access: + +Mercurial currently supports pulling from "serverless" repositories. +Simply making the repo directory accessibly via the web and pointing +hg at it can accomplish a pull. This is relatively bandwidth efficient +but no effort has been spent on pipelining, so it won't work +especially well over LAN yet. + +It's also quite amenable to rsync, if you don't mind keeping an intact +copy of the master around locally. + +Also note the append-only and ordering properties of the commit +guarantee that readers will always see a repository in a consistent +state and no special locking is necessary. As there is generally only +one writer to an hg repository, there is in fact no exclusion +implemented yet. + + +Some comparisons to git: + +Most notably, Mercurial uses delta compression and repositories +created with it will grow much more slowly over time. This also allows +it to be much more bandwidth efficient. I expect repos sizes and sync +speeds to be similar to or better than BK, given the use of binary diffs. + +Mercurial is roughly the same performance as git and is faster in +others as it keeps around more metadata. One example is listing and +retrieving past versions of a file, which it can do without reading +all the changesets. This metadata will also allow it to perform better +merges as described above. + +