changing-files: rework the way we store changed files in side-data
We need to store new data so this is a good opportunity to rework this fully.
1) We directly store the list of affected file in the side data:
* This avoid having to fetch and parse the `files` list in the revision in
addition to the sidedata. Making the data more self sufficient.
* This work around situation where that `files` field contains wrong
information, and open the way to other bug fixing (eg:
issue6219)
* The format (fixed initial index, sorted files) allow for fast lookup of
filename within the structure.
* This unify the storage of affected files and copies sources and destination,
limiting the number filename stored redundantly.
* This prepare for the fact we should drop the `files` as soon as we do any
change affecting the revision schema.
* This rely on compression to avoid a significant increase of the changelog.d.
More testing on this will be done before we freeze the final format.
2) We can store additional data:
* The new "merged" field,
* A future "salvaged" set recording files that might have been deleted but have
were still present in the final result.
Differential Revision: https://phab.mercurial-scm.org/D9090
--- a/mercurial/helptext/internals/revlogs.txt Mon Oct 05 10:33:52 2020 +0200
+++ b/mercurial/helptext/internals/revlogs.txt Tue Sep 15 10:55:17 2020 +0200
@@ -239,3 +239,75 @@
2. Hash the fulltext of the revision
The 20 byte node ids of the parents are fed into the hasher in ascending order.
+
+Changed Files side-data
+=======================
+
+(This feature is in active development and its behavior is not frozen yet. It
+should not be used in any production repository)
+
+When the `exp-copies-sidedata-changeset` requirement is in use, information
+related to the changed files will be stored as "side-data" for every changeset
+in the changelog.
+
+These data contains the following information:
+
+* set of files actively added by the changeset
+* set of files actively removed by the changeset
+* set of files actively merged by the changeset
+* set of files actively touched by he changeset
+* mapping of copy-source, copy-destination from first parent (p1)
+* mapping of copy-source, copy-destination from second parent (p2)
+
+The block itself is big-endian data, formatted in three sections: header, index,
+and data. See below for details:
+
+Header:
+
+ 4 bytes: unsigned integer
+
+ total number of entry in the index
+
+Index:
+
+ The index contains an entry for every involved filename. It is sorted by
+ filename. The entry use the following format:
+
+ 1 byte: bits field
+
+ This byte hold two different bit fields:
+
+ The 2 lower bits carry copy information:
+
+ `00`: file has not copy information,
+ `10`: file is copied from a p1 source,
+ `11`: file is copied from a p2 source.
+
+ The 3 next bits carry action information.
+
+ `000`: file was untouched, it exist in the index as copy source,
+ `001`: file was actively added
+ `010`: file was actively merged
+ `011`: file was actively removed
+ `100`: reserved for future use
+ `101`: file was actively touched in any other way
+
+ (The last 2 bites are unused)
+
+ 4 bytes: unsigned integer
+
+ Address (in bytes) of the end of the associated filename in the data
+ block. (This is the address of the first byte not part of the filename)
+
+ The start of the filename can be retrieve by reading that field for the
+ previous index entry. The filename of the first entry starts at zero.
+
+ 4 bytes: unsigned integer
+
+ Index (in this very index) of the source of the copy (when a copy is
+ happening). If no copy is happening the value of this field is
+ irrelevant and could have any value. It is set to zero by convention
+
+Data:
+
+ raw bytes block containing all filename concatenated without any separator.
--- a/mercurial/metadata.py Mon Oct 05 10:33:52 2020 +0200
+++ b/mercurial/metadata.py Tue Sep 15 10:55:17 2020 +0200
@@ -8,6 +8,7 @@
from __future__ import absolute_import, print_function
import multiprocessing
+import struct
from . import (
error,
@@ -373,54 +374,112 @@
return None
+# see mercurial/helptext/internals/revlogs.txt for details about the format
+
+ACTION_MASK = int("111" "00", 2)
+# note: untouched file used as copy source will as `000` for this mask.
+ADDED_FLAG = int("001" "00", 2)
+MERGED_FLAG = int("010" "00", 2)
+REMOVED_FLAG = int("011" "00", 2)
+# `100` is reserved for future use
+TOUCHED_FLAG = int("101" "00", 2)
+
+COPIED_MASK = int("11", 2)
+COPIED_FROM_P1_FLAG = int("10", 2)
+COPIED_FROM_P2_FLAG = int("11", 2)
+
+# structure is <flag><filename-end><copy-source>
+INDEX_HEADER = struct.Struct(">L")
+INDEX_ENTRY = struct.Struct(">bLL")
+
+
def encode_files_sidedata(files):
- sortedfiles = sorted(files.touched)
- sidedata = {}
- p1copies = files.copied_from_p1
- if p1copies:
- p1copies = encodecopies(sortedfiles, p1copies)
- sidedata[sidedatamod.SD_P1COPIES] = p1copies
- p2copies = files.copied_from_p2
- if p2copies:
- p2copies = encodecopies(sortedfiles, p2copies)
- sidedata[sidedatamod.SD_P2COPIES] = p2copies
- filesadded = files.added
- if filesadded:
- filesadded = encodefileindices(sortedfiles, filesadded)
- sidedata[sidedatamod.SD_FILESADDED] = filesadded
- filesremoved = files.removed
- if filesremoved:
- filesremoved = encodefileindices(sortedfiles, filesremoved)
- sidedata[sidedatamod.SD_FILESREMOVED] = filesremoved
- if not sidedata:
- sidedata = None
- return sidedata
+ all_files = set(files.touched)
+ all_files.update(files.copied_from_p1.values())
+ all_files.update(files.copied_from_p2.values())
+ all_files = sorted(all_files)
+ file_idx = {f: i for (i, f) in enumerate(all_files)}
+ file_idx[None] = 0
+
+ chunks = [INDEX_HEADER.pack(len(all_files))]
+
+ filename_length = 0
+ for f in all_files:
+ filename_size = len(f)
+ filename_length += filename_size
+ flag = 0
+ if f in files.added:
+ flag |= ADDED_FLAG
+ elif f in files.merged:
+ flag |= MERGED_FLAG
+ elif f in files.removed:
+ flag |= REMOVED_FLAG
+ elif f in files.touched:
+ flag |= TOUCHED_FLAG
+
+ copy = None
+ if f in files.copied_from_p1:
+ flag |= COPIED_FROM_P1_FLAG
+ copy = files.copied_from_p1.get(f)
+ elif f in files.copied_from_p2:
+ copy = files.copied_from_p2.get(f)
+ flag |= COPIED_FROM_P2_FLAG
+ copy_idx = file_idx[copy]
+ chunks.append(INDEX_ENTRY.pack(flag, filename_length, copy_idx))
+ chunks.extend(all_files)
+ return {sidedatamod.SD_FILES: b''.join(chunks)}
def decode_files_sidedata(changelogrevision, sidedata):
- """Return a ChangingFiles instance from a changelogrevision using sidata
- """
- touched = changelogrevision.files
+ md = ChangingFiles()
+ raw = sidedata.get(sidedatamod.SD_FILES)
+
+ if raw is None:
+ return md
+
+ copies = []
+ all_files = []
- rawindices = sidedata.get(sidedatamod.SD_FILESADDED)
- added = decodefileindices(touched, rawindices)
+ assert len(raw) >= INDEX_HEADER.size
+ total_files = INDEX_HEADER.unpack_from(raw, 0)[0]
- rawindices = sidedata.get(sidedatamod.SD_FILESREMOVED)
- removed = decodefileindices(touched, rawindices)
+ offset = INDEX_HEADER.size
+ file_offset_base = offset + (INDEX_ENTRY.size * total_files)
+ file_offset_last = file_offset_base
+
+ assert len(raw) >= file_offset_base
- rawcopies = sidedata.get(sidedatamod.SD_P1COPIES)
- p1_copies = decodecopies(touched, rawcopies)
-
- rawcopies = sidedata.get(sidedatamod.SD_P2COPIES)
- p2_copies = decodecopies(touched, rawcopies)
+ for idx in range(total_files):
+ flag, file_end, copy_idx = INDEX_ENTRY.unpack_from(raw, offset)
+ file_end += file_offset_base
+ filename = raw[file_offset_last:file_end]
+ filesize = file_end - file_offset_last
+ assert len(filename) == filesize
+ offset += INDEX_ENTRY.size
+ file_offset_last = file_end
+ all_files.append(filename)
+ if flag & ACTION_MASK == ADDED_FLAG:
+ md.mark_added(filename)
+ elif flag & ACTION_MASK == MERGED_FLAG:
+ md.mark_merged(filename)
+ elif flag & ACTION_MASK == REMOVED_FLAG:
+ md.mark_removed(filename)
+ elif flag & ACTION_MASK == TOUCHED_FLAG:
+ md.mark_touched(filename)
- return ChangingFiles(
- touched=touched,
- added=added,
- removed=removed,
- p1_copies=p1_copies,
- p2_copies=p2_copies,
- )
+ copied = None
+ if flag & COPIED_MASK == COPIED_FROM_P1_FLAG:
+ copied = md.mark_copied_from_p1
+ elif flag & COPIED_MASK == COPIED_FROM_P2_FLAG:
+ copied = md.mark_copied_from_p2
+
+ if copied is not None:
+ copies.append((copied, filename, copy_idx))
+
+ for copied, filename, copy_idx in copies:
+ copied(all_files[copy_idx], filename)
+
+ return md
def _getsidedata(srcrepo, rev):
@@ -428,23 +487,15 @@
filescopies = computechangesetcopies(ctx)
filesadded = computechangesetfilesadded(ctx)
filesremoved = computechangesetfilesremoved(ctx)
- sidedata = {}
- if any([filescopies, filesadded, filesremoved]):
- sortedfiles = sorted(ctx.files())
- p1copies, p2copies = filescopies
- p1copies = encodecopies(sortedfiles, p1copies)
- p2copies = encodecopies(sortedfiles, p2copies)
- filesadded = encodefileindices(sortedfiles, filesadded)
- filesremoved = encodefileindices(sortedfiles, filesremoved)
- if p1copies:
- sidedata[sidedatamod.SD_P1COPIES] = p1copies
- if p2copies:
- sidedata[sidedatamod.SD_P2COPIES] = p2copies
- if filesadded:
- sidedata[sidedatamod.SD_FILESADDED] = filesadded
- if filesremoved:
- sidedata[sidedatamod.SD_FILESREMOVED] = filesremoved
- return sidedata
+ filesmerged = computechangesetfilesmerged(ctx)
+ files = ChangingFiles()
+ files.update_touched(ctx.files())
+ files.update_added(filesadded)
+ files.update_removed(filesremoved)
+ files.update_merged(filesmerged)
+ files.update_copies_from_p1(filescopies[0])
+ files.update_copies_from_p2(filescopies[1])
+ return encode_files_sidedata(files)
def getsidedataadder(srcrepo, destrepo):
--- a/mercurial/revlogutils/sidedata.py Mon Oct 05 10:33:52 2020 +0200
+++ b/mercurial/revlogutils/sidedata.py Tue Sep 15 10:55:17 2020 +0200
@@ -53,6 +53,7 @@
SD_P2COPIES = 9
SD_FILESADDED = 10
SD_FILESREMOVED = 11
+SD_FILES = 12
# internal format constant
SIDEDATA_HEADER = struct.Struct('>H')
--- a/tests/test-copies-in-changeset.t Mon Oct 05 10:33:52 2020 +0200
+++ b/tests/test-copies-in-changeset.t Tue Sep 15 10:55:17 2020 +0200
@@ -79,11 +79,9 @@
2\x00a (esc)
#else
$ hg debugsidedata -c -v -- -1
- 2 sidedata entries
- entry-0010 size 11
- '0\x00a\n1\x00a\n2\x00a'
- entry-0012 size 5
- '0\n1\n2'
+ 1 sidedata entries
+ entry-0014 size 44
+ '\x00\x00\x00\x04\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00\x06\x00\x00\x00\x03\x00\x00\x00\x00\x06\x00\x00\x00\x04\x00\x00\x00\x00abcd'
#endif
$ hg showcopies
@@ -117,13 +115,9 @@
#else
$ hg debugsidedata -c -v -- -1
- 3 sidedata entries
- entry-0010 size 3
- '1\x00b'
- entry-0012 size 1
- '1'
- entry-0013 size 1
- '0'
+ 1 sidedata entries
+ entry-0014 size 25
+ '\x00\x00\x00\x02\x0c\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x03\x00\x00\x00\x00bb2'
#endif
$ hg showcopies
@@ -165,8 +159,8 @@
#else
$ hg debugsidedata -c -v -- -1
1 sidedata entries
- entry-0010 size 4
- '0\x00b2'
+ entry-0014 size 25
+ '\x00\x00\x00\x02\x00\x00\x00\x00\x02\x00\x00\x00\x00\x16\x00\x00\x00\x03\x00\x00\x00\x00b2c'
#endif
$ hg showcopies
@@ -221,13 +215,9 @@
#else
$ hg debugsidedata -c -v -- -1
- 3 sidedata entries
- entry-0010 size 7
- '0\x00a\n2\x00f'
- entry-0011 size 3
- '1\x00d'
- entry-0012 size 5
- '0\n1\n2'
+ 1 sidedata entries
+ entry-0014 size 64
+ '\x00\x00\x00\x06\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x06\x00\x00\x00\x04\x00\x00\x00\x00\x07\x00\x00\x00\x05\x00\x00\x00\x01\x06\x00\x00\x00\x06\x00\x00\x00\x02adfghi'
#endif
$ hg showcopies
@@ -250,11 +240,9 @@
#else
$ hg ci -m 'copy a to j'
$ hg debugsidedata -c -v -- -1
- 2 sidedata entries
- entry-0010 size 3
- '0\x00a'
- entry-0012 size 1
- '0'
+ 1 sidedata entries
+ entry-0014 size 24
+ '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00aj'
#endif
$ hg debugdata j 0
\x01 (esc)
@@ -281,11 +269,9 @@
$ hg ci --amend -m 'copy a to j, v2'
saved backup bundle to $TESTTMP/repo/.hg/strip-backup/*-*-amend.hg (glob)
$ hg debugsidedata -c -v -- -1
- 2 sidedata entries
- entry-0010 size 3
- '0\x00a'
- entry-0012 size 1
- '0'
+ 1 sidedata entries
+ entry-0014 size 24
+ '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00aj'
#endif
$ hg showcopies --config experimental.copies.read-from=filelog-only
a -> j
@@ -304,6 +290,9 @@
#else
$ hg ci -m 'modify j'
$ hg debugsidedata -c -v -- -1
+ 1 sidedata entries
+ entry-0014 size 14
+ '\x00\x00\x00\x01\x14\x00\x00\x00\x01\x00\x00\x00\x00j'
#endif
Test writing only to filelog
@@ -318,11 +307,9 @@
#else
$ hg ci -m 'copy a to k'
$ hg debugsidedata -c -v -- -1
- 2 sidedata entries
- entry-0010 size 3
- '0\x00a'
- entry-0012 size 1
- '0'
+ 1 sidedata entries
+ entry-0014 size 24
+ '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00ak'
#endif
$ hg debugdata k 0
@@ -439,10 +426,10 @@
compression-level: default default default
$ hg debugsidedata -c -- 0
1 sidedata entries
- entry-0012 size 1
+ entry-0014 size 14
$ hg debugsidedata -c -- 1
1 sidedata entries
- entry-0013 size 1
+ entry-0014 size 14
$ hg debugsidedata -m -- 0
$ cat << EOF > .hg/hgrc
> [format]
@@ -463,7 +450,11 @@
compression: zlib zlib zlib
compression-level: default default default
$ hg debugsidedata -c -- 0
+ 1 sidedata entries
+ entry-0014 size 14
$ hg debugsidedata -c -- 1
+ 1 sidedata entries
+ entry-0014 size 14
$ hg debugsidedata -m -- 0
upgrading
@@ -487,10 +478,10 @@
compression-level: default default default
$ hg debugsidedata -c -- 0
1 sidedata entries
- entry-0012 size 1
+ entry-0014 size 14
$ hg debugsidedata -c -- 1
1 sidedata entries
- entry-0013 size 1
+ entry-0014 size 14
$ hg debugsidedata -m -- 0
#endif