help: add documentation for changegroup formats
There is no formal location for spec-like technical/internal docs. The
repository makes sense as such a location because spec-like
documentation should be reviewed (ruling out a wiki). mpm has also
stated that he would like this documentation to be part of the
built-in help system. So, we establish an "internals" sub-directory
to hold this class of documentation.
The format of changegroups does not appear to be documented anywhere,
even in source code. It therefore seemed like an appropriate first thing
to document.
This patch adds low-level documentation of versions 1 and 2 of the
changegroup foromat. It currently only describes the raw data format.
There is probably room to write higher-level documentation on strategies
for producing and consuming the data. We'll leave that for another day.
The added file is not yet accessible via `hg help` nor via hgweb.
Support for this will follow in subsequent patches.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/mercurial/help/internals/changegroups.txt Sun Oct 25 00:19:45 2015 +0100
@@ -0,0 +1,142 @@
+Changegroups
+============
+
+Changegroups are representations of repository revlog data, specifically
+the changelog, manifest, and filelogs.
+
+There are 2 versions of changegroups: ``1`` and ``2``. From a
+high-level, they are almost exactly the same, with the only difference
+being a header on entries in the changeset segment.
+
+Changegroups consists of 3 logical segments::
+
+ +---------------------------------+
+ | | | |
+ | changeset | manifest | filelogs |
+ | | | |
+ +---------------------------------+
+
+The principle building block of each segment is a *chunk*. A *chunk*
+is a framed piece of data::
+
+ +---------------------------------------+
+ | | |
+ | length | data |
+ | (32 bits) | <length> bytes |
+ | | |
+ +---------------------------------------+
+
+Each chunk starts with a 32-bit big-endian signed integer indicating
+the length of the raw data that follows.
+
+There is a special case chunk that has 0 length (``0x00000000``). We
+call this an *empty chunk*.
+
+Delta Groups
+------------
+
+A *delta group* expresses the content of a revlog as a series of deltas,
+or patches against previous revisions.
+
+Delta groups consist of 0 or more *chunks* followed by the *empty chunk*
+to signal the end of the delta group::
+
+ +------------------------------------------------------------------------+
+ | | | | | |
+ | chunk0 length | chunk0 data | chunk1 length | chunk1 data | 0x0 |
+ | (32 bits) | (various) | (32 bits) | (various) | (32 bits) |
+ | | | | | |
+ +------------------------------------------------------------+-----------+
+
+Each *chunk*'s data consists of the following::
+
+ +-----------------------------------------+
+ | | | |
+ | delta header | mdiff header | delta |
+ | (various) | (12 bytes) | (various) |
+ | | | |
+ +-----------------------------------------+
+
+The *length* field is the byte length of the remaining 3 logical pieces
+of data. The *delta* is a diff from an existing entry in the changelog.
+
+The *delta header* is different between versions ``1`` and ``2`` of the
+changegroup format.
+
+Version 1::
+
+ +------------------------------------------------------+
+ | | | | |
+ | node | p1 node | p2 node | link node |
+ | (20 bytes) | (20 bytes) | (20 bytes) | (20 bytes) |
+ | | | | |
+ +------------------------------------------------------+
+
+Version 2::
+
+ +------------------------------------------------------------------+
+ | | | | | |
+ | node | p1 node | p2 node | base node | link node |
+ | (20 bytes) | (20 bytes) | (20 bytes) | (20 bytes) | (20 bytes) |
+ | | | | | |
+ +------------------------------------------------------------------+
+
+The *mdiff header* consists of 3 32-bit big-endian signed integers
+describing offsets at which to apply the following delta content::
+
+ +-------------------------------------+
+ | | | |
+ | offset | old length | new length |
+ | (32 bits) | (32 bits) | (32 bits) |
+ | | | |
+ +-------------------------------------+
+
+In version 1, the delta is always applied against the previous node from
+the changegroup or the first parent if this is the first entry in the
+changegroup.
+
+In version 2, the delta base node is encoded in the entry in the
+changegroup. This allows the delta to be expressed against any parent,
+which can result in smaller deltas and more efficient encoding of data.
+
+Changeset Segment
+-----------------
+
+The *changeset segment* consists of a single *delta group* holding
+changelog data. It is followed by an *empty chunk* to denote the
+boundary to the *manifests segment*.
+
+Manifest Segment
+----------------
+
+The *manifest segment* consists of a single *delta group* holding
+manifest data. It is followed by an *empty chunk* to denote the boundary
+to the *filelogs segment*.
+
+Filelogs Segment
+----------------
+
+The *filelogs* segment consists of multiple sub-segments, each
+corresponding to an individual file whose data is being described::
+
+ +--------------------------------------+
+ | | | | |
+ | filelog0 | filelog1 | filelog2 | ... |
+ | | | | |
+ +--------------------------------------+
+
+The final filelog sub-segment is followed by an *empty chunk* to denote
+the end of the segment and the overall changegroup.
+
+Each filelog sub-segment consists of the following::
+
+ +------------------------------------------+
+ | | | |
+ | filename size | filename | delta group |
+ | (32 bits) | (various) | (various) |
+ | | | |
+ +------------------------------------------+
+
+That is, a *chunk* consisting of the filename (not terminated or padded)
+followed by N chunks constituting the *delta group* for this file.
+