Mercurial > hg
changeset 45634:9a6b409b8ebc
changing-files: rework the way we store changed files in side-data
We need to store new data so this is a good opportunity to rework this fully.
1) We directly store the list of affected file in the side data:
* This avoid having to fetch and parse the `files` list in the revision in
addition to the sidedata. Making the data more self sufficient.
* This work around situation where that `files` field contains wrong
information, and open the way to other bug fixing (eg: issue6219)
* The format (fixed initial index, sorted files) allow for fast lookup of
filename within the structure.
* This unify the storage of affected files and copies sources and destination,
limiting the number filename stored redundantly.
* This prepare for the fact we should drop the `files` as soon as we do any
change affecting the revision schema.
* This rely on compression to avoid a significant increase of the changelog.d.
More testing on this will be done before we freeze the final format.
2) We can store additional data:
* The new "merged" field,
* A future "salvaged" set recording files that might have been deleted but have
were still present in the final result.
Differential Revision: https://phab.mercurial-scm.org/D9090
author | Pierre-Yves David <pierre-yves.david@octobus.net> |
---|---|
date | Tue, 15 Sep 2020 10:55:17 +0200 |
parents | 7d0e54056586 |
children | 9003e6524f78 |
files | mercurial/helptext/internals/revlogs.txt mercurial/metadata.py mercurial/revlogutils/sidedata.py tests/test-copies-in-changeset.t |
diffstat | 4 files changed, 212 insertions(+), 97 deletions(-) [+] |
line wrap: on
line diff
--- a/mercurial/helptext/internals/revlogs.txt Mon Oct 05 10:33:52 2020 +0200 +++ b/mercurial/helptext/internals/revlogs.txt Tue Sep 15 10:55:17 2020 +0200 @@ -239,3 +239,75 @@ 2. Hash the fulltext of the revision The 20 byte node ids of the parents are fed into the hasher in ascending order. + +Changed Files side-data +======================= + +(This feature is in active development and its behavior is not frozen yet. It +should not be used in any production repository) + +When the `exp-copies-sidedata-changeset` requirement is in use, information +related to the changed files will be stored as "side-data" for every changeset +in the changelog. + +These data contains the following information: + +* set of files actively added by the changeset +* set of files actively removed by the changeset +* set of files actively merged by the changeset +* set of files actively touched by he changeset +* mapping of copy-source, copy-destination from first parent (p1) +* mapping of copy-source, copy-destination from second parent (p2) + +The block itself is big-endian data, formatted in three sections: header, index, +and data. See below for details: + +Header: + + 4 bytes: unsigned integer + + total number of entry in the index + +Index: + + The index contains an entry for every involved filename. It is sorted by + filename. The entry use the following format: + + 1 byte: bits field + + This byte hold two different bit fields: + + The 2 lower bits carry copy information: + + `00`: file has not copy information, + `10`: file is copied from a p1 source, + `11`: file is copied from a p2 source. + + The 3 next bits carry action information. + + `000`: file was untouched, it exist in the index as copy source, + `001`: file was actively added + `010`: file was actively merged + `011`: file was actively removed + `100`: reserved for future use + `101`: file was actively touched in any other way + + (The last 2 bites are unused) + + 4 bytes: unsigned integer + + Address (in bytes) of the end of the associated filename in the data + block. (This is the address of the first byte not part of the filename) + + The start of the filename can be retrieve by reading that field for the + previous index entry. The filename of the first entry starts at zero. + + 4 bytes: unsigned integer + + Index (in this very index) of the source of the copy (when a copy is + happening). If no copy is happening the value of this field is + irrelevant and could have any value. It is set to zero by convention + +Data: + + raw bytes block containing all filename concatenated without any separator.
--- a/mercurial/metadata.py Mon Oct 05 10:33:52 2020 +0200 +++ b/mercurial/metadata.py Tue Sep 15 10:55:17 2020 +0200 @@ -8,6 +8,7 @@ from __future__ import absolute_import, print_function import multiprocessing +import struct from . import ( error, @@ -373,54 +374,112 @@ return None +# see mercurial/helptext/internals/revlogs.txt for details about the format + +ACTION_MASK = int("111" "00", 2) +# note: untouched file used as copy source will as `000` for this mask. +ADDED_FLAG = int("001" "00", 2) +MERGED_FLAG = int("010" "00", 2) +REMOVED_FLAG = int("011" "00", 2) +# `100` is reserved for future use +TOUCHED_FLAG = int("101" "00", 2) + +COPIED_MASK = int("11", 2) +COPIED_FROM_P1_FLAG = int("10", 2) +COPIED_FROM_P2_FLAG = int("11", 2) + +# structure is <flag><filename-end><copy-source> +INDEX_HEADER = struct.Struct(">L") +INDEX_ENTRY = struct.Struct(">bLL") + + def encode_files_sidedata(files): - sortedfiles = sorted(files.touched) - sidedata = {} - p1copies = files.copied_from_p1 - if p1copies: - p1copies = encodecopies(sortedfiles, p1copies) - sidedata[sidedatamod.SD_P1COPIES] = p1copies - p2copies = files.copied_from_p2 - if p2copies: - p2copies = encodecopies(sortedfiles, p2copies) - sidedata[sidedatamod.SD_P2COPIES] = p2copies - filesadded = files.added - if filesadded: - filesadded = encodefileindices(sortedfiles, filesadded) - sidedata[sidedatamod.SD_FILESADDED] = filesadded - filesremoved = files.removed - if filesremoved: - filesremoved = encodefileindices(sortedfiles, filesremoved) - sidedata[sidedatamod.SD_FILESREMOVED] = filesremoved - if not sidedata: - sidedata = None - return sidedata + all_files = set(files.touched) + all_files.update(files.copied_from_p1.values()) + all_files.update(files.copied_from_p2.values()) + all_files = sorted(all_files) + file_idx = {f: i for (i, f) in enumerate(all_files)} + file_idx[None] = 0 + + chunks = [INDEX_HEADER.pack(len(all_files))] + + filename_length = 0 + for f in all_files: + filename_size = len(f) + filename_length += filename_size + flag = 0 + if f in files.added: + flag |= ADDED_FLAG + elif f in files.merged: + flag |= MERGED_FLAG + elif f in files.removed: + flag |= REMOVED_FLAG + elif f in files.touched: + flag |= TOUCHED_FLAG + + copy = None + if f in files.copied_from_p1: + flag |= COPIED_FROM_P1_FLAG + copy = files.copied_from_p1.get(f) + elif f in files.copied_from_p2: + copy = files.copied_from_p2.get(f) + flag |= COPIED_FROM_P2_FLAG + copy_idx = file_idx[copy] + chunks.append(INDEX_ENTRY.pack(flag, filename_length, copy_idx)) + chunks.extend(all_files) + return {sidedatamod.SD_FILES: b''.join(chunks)} def decode_files_sidedata(changelogrevision, sidedata): - """Return a ChangingFiles instance from a changelogrevision using sidata - """ - touched = changelogrevision.files + md = ChangingFiles() + raw = sidedata.get(sidedatamod.SD_FILES) + + if raw is None: + return md + + copies = [] + all_files = [] - rawindices = sidedata.get(sidedatamod.SD_FILESADDED) - added = decodefileindices(touched, rawindices) + assert len(raw) >= INDEX_HEADER.size + total_files = INDEX_HEADER.unpack_from(raw, 0)[0] - rawindices = sidedata.get(sidedatamod.SD_FILESREMOVED) - removed = decodefileindices(touched, rawindices) + offset = INDEX_HEADER.size + file_offset_base = offset + (INDEX_ENTRY.size * total_files) + file_offset_last = file_offset_base + + assert len(raw) >= file_offset_base - rawcopies = sidedata.get(sidedatamod.SD_P1COPIES) - p1_copies = decodecopies(touched, rawcopies) - - rawcopies = sidedata.get(sidedatamod.SD_P2COPIES) - p2_copies = decodecopies(touched, rawcopies) + for idx in range(total_files): + flag, file_end, copy_idx = INDEX_ENTRY.unpack_from(raw, offset) + file_end += file_offset_base + filename = raw[file_offset_last:file_end] + filesize = file_end - file_offset_last + assert len(filename) == filesize + offset += INDEX_ENTRY.size + file_offset_last = file_end + all_files.append(filename) + if flag & ACTION_MASK == ADDED_FLAG: + md.mark_added(filename) + elif flag & ACTION_MASK == MERGED_FLAG: + md.mark_merged(filename) + elif flag & ACTION_MASK == REMOVED_FLAG: + md.mark_removed(filename) + elif flag & ACTION_MASK == TOUCHED_FLAG: + md.mark_touched(filename) - return ChangingFiles( - touched=touched, - added=added, - removed=removed, - p1_copies=p1_copies, - p2_copies=p2_copies, - ) + copied = None + if flag & COPIED_MASK == COPIED_FROM_P1_FLAG: + copied = md.mark_copied_from_p1 + elif flag & COPIED_MASK == COPIED_FROM_P2_FLAG: + copied = md.mark_copied_from_p2 + + if copied is not None: + copies.append((copied, filename, copy_idx)) + + for copied, filename, copy_idx in copies: + copied(all_files[copy_idx], filename) + + return md def _getsidedata(srcrepo, rev): @@ -428,23 +487,15 @@ filescopies = computechangesetcopies(ctx) filesadded = computechangesetfilesadded(ctx) filesremoved = computechangesetfilesremoved(ctx) - sidedata = {} - if any([filescopies, filesadded, filesremoved]): - sortedfiles = sorted(ctx.files()) - p1copies, p2copies = filescopies - p1copies = encodecopies(sortedfiles, p1copies) - p2copies = encodecopies(sortedfiles, p2copies) - filesadded = encodefileindices(sortedfiles, filesadded) - filesremoved = encodefileindices(sortedfiles, filesremoved) - if p1copies: - sidedata[sidedatamod.SD_P1COPIES] = p1copies - if p2copies: - sidedata[sidedatamod.SD_P2COPIES] = p2copies - if filesadded: - sidedata[sidedatamod.SD_FILESADDED] = filesadded - if filesremoved: - sidedata[sidedatamod.SD_FILESREMOVED] = filesremoved - return sidedata + filesmerged = computechangesetfilesmerged(ctx) + files = ChangingFiles() + files.update_touched(ctx.files()) + files.update_added(filesadded) + files.update_removed(filesremoved) + files.update_merged(filesmerged) + files.update_copies_from_p1(filescopies[0]) + files.update_copies_from_p2(filescopies[1]) + return encode_files_sidedata(files) def getsidedataadder(srcrepo, destrepo):
--- a/mercurial/revlogutils/sidedata.py Mon Oct 05 10:33:52 2020 +0200 +++ b/mercurial/revlogutils/sidedata.py Tue Sep 15 10:55:17 2020 +0200 @@ -53,6 +53,7 @@ SD_P2COPIES = 9 SD_FILESADDED = 10 SD_FILESREMOVED = 11 +SD_FILES = 12 # internal format constant SIDEDATA_HEADER = struct.Struct('>H')
--- a/tests/test-copies-in-changeset.t Mon Oct 05 10:33:52 2020 +0200 +++ b/tests/test-copies-in-changeset.t Tue Sep 15 10:55:17 2020 +0200 @@ -79,11 +79,9 @@ 2\x00a (esc) #else $ hg debugsidedata -c -v -- -1 - 2 sidedata entries - entry-0010 size 11 - '0\x00a\n1\x00a\n2\x00a' - entry-0012 size 5 - '0\n1\n2' + 1 sidedata entries + entry-0014 size 44 + '\x00\x00\x00\x04\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00\x06\x00\x00\x00\x03\x00\x00\x00\x00\x06\x00\x00\x00\x04\x00\x00\x00\x00abcd' #endif $ hg showcopies @@ -117,13 +115,9 @@ #else $ hg debugsidedata -c -v -- -1 - 3 sidedata entries - entry-0010 size 3 - '1\x00b' - entry-0012 size 1 - '1' - entry-0013 size 1 - '0' + 1 sidedata entries + entry-0014 size 25 + '\x00\x00\x00\x02\x0c\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x03\x00\x00\x00\x00bb2' #endif $ hg showcopies @@ -165,8 +159,8 @@ #else $ hg debugsidedata -c -v -- -1 1 sidedata entries - entry-0010 size 4 - '0\x00b2' + entry-0014 size 25 + '\x00\x00\x00\x02\x00\x00\x00\x00\x02\x00\x00\x00\x00\x16\x00\x00\x00\x03\x00\x00\x00\x00b2c' #endif $ hg showcopies @@ -221,13 +215,9 @@ #else $ hg debugsidedata -c -v -- -1 - 3 sidedata entries - entry-0010 size 7 - '0\x00a\n2\x00f' - entry-0011 size 3 - '1\x00d' - entry-0012 size 5 - '0\n1\n2' + 1 sidedata entries + entry-0014 size 64 + '\x00\x00\x00\x06\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x06\x00\x00\x00\x04\x00\x00\x00\x00\x07\x00\x00\x00\x05\x00\x00\x00\x01\x06\x00\x00\x00\x06\x00\x00\x00\x02adfghi' #endif $ hg showcopies @@ -250,11 +240,9 @@ #else $ hg ci -m 'copy a to j' $ hg debugsidedata -c -v -- -1 - 2 sidedata entries - entry-0010 size 3 - '0\x00a' - entry-0012 size 1 - '0' + 1 sidedata entries + entry-0014 size 24 + '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00aj' #endif $ hg debugdata j 0 \x01 (esc) @@ -281,11 +269,9 @@ $ hg ci --amend -m 'copy a to j, v2' saved backup bundle to $TESTTMP/repo/.hg/strip-backup/*-*-amend.hg (glob) $ hg debugsidedata -c -v -- -1 - 2 sidedata entries - entry-0010 size 3 - '0\x00a' - entry-0012 size 1 - '0' + 1 sidedata entries + entry-0014 size 24 + '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00aj' #endif $ hg showcopies --config experimental.copies.read-from=filelog-only a -> j @@ -304,6 +290,9 @@ #else $ hg ci -m 'modify j' $ hg debugsidedata -c -v -- -1 + 1 sidedata entries + entry-0014 size 14 + '\x00\x00\x00\x01\x14\x00\x00\x00\x01\x00\x00\x00\x00j' #endif Test writing only to filelog @@ -318,11 +307,9 @@ #else $ hg ci -m 'copy a to k' $ hg debugsidedata -c -v -- -1 - 2 sidedata entries - entry-0010 size 3 - '0\x00a' - entry-0012 size 1 - '0' + 1 sidedata entries + entry-0014 size 24 + '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00ak' #endif $ hg debugdata k 0 @@ -439,10 +426,10 @@ compression-level: default default default $ hg debugsidedata -c -- 0 1 sidedata entries - entry-0012 size 1 + entry-0014 size 14 $ hg debugsidedata -c -- 1 1 sidedata entries - entry-0013 size 1 + entry-0014 size 14 $ hg debugsidedata -m -- 0 $ cat << EOF > .hg/hgrc > [format] @@ -463,7 +450,11 @@ compression: zlib zlib zlib compression-level: default default default $ hg debugsidedata -c -- 0 + 1 sidedata entries + entry-0014 size 14 $ hg debugsidedata -c -- 1 + 1 sidedata entries + entry-0014 size 14 $ hg debugsidedata -m -- 0 upgrading @@ -487,10 +478,10 @@ compression-level: default default default $ hg debugsidedata -c -- 0 1 sidedata entries - entry-0012 size 1 + entry-0014 size 14 $ hg debugsidedata -c -- 1 1 sidedata entries - entry-0013 size 1 + entry-0014 size 14 $ hg debugsidedata -m -- 0 #endif