Mercurial > hg
view tests/test-pathencode.py @ 39515:93486cc46125
treemanifest: introduce lazy loading of subdirs
An earlier patch series made it so that what to load was up to the calling code,
which works fine until manifests are copied - when they're copied, they're
loaded completely and thus we lose the entire benefit.
By lazy loading everything, we can avoid having to pass in the matcher to ~every
manifest function, and handle copies correctly as well. This changeset doesn't
go as far as it could with loading only the necessary subsets, that will happen
in later changes in this series; at the moment, except in a few situations, we
just load everything the moment we want to interact with treemanifest._dirs.
This is thus most likely to be a small slowdown if treemanifests is in use
regardless of whether narrow is in use, but hopefully easier to verify
correctness and review.
This is part of a series of speedups, it is not expected to produce any real speed
improvements itself, but the numbers show that it doesn't produce a large speed
penalty in any common case, and for the cases it does provide a penalty in, it
is not a large absolute amount (even if it is a large percentage amount).
Timing numbers according to command:
hyperfine --prepare <preparation_script> 'hg status'
HGRCPATH points to a file with the following contents:
[extensions]
narrow =
strip =
rebase =
mozilla-unified (called m-u below) was at revision #468856.
regular hash: eb39298e432d
treemanifests hash: 0553b7f29eaf
large-dir-repo (called l-d-r below) was generated with the following script:
#!/bin/bash
hg init large-dir-repo
mkdir -p large-dir-repo/third_party/rust/log
touch large-dir-repo/third_party/rust/log/foo.txt
for i in $(seq 1 30000); do
d=$(mktemp -d large-dir-repo/third_party/XXXXXXXXX)
touch $d/file.txt
done
hg -R large-dir-repo ci -Am 'rev0' --user test --date '0 0'
echo hi > large-dir-repo/third_party/rust/log/bar.txt
hg -R large-dir-repo ci -Am 'rev1' --user test --date '0 0'
echo hi > large-dir-repo/third_party/rust/log/baz.txt
hg -R large-dir-repo ci -Am 'rev2' --user test --date '0 0'
for the repos that use narrow, the narrowspec was this:
[include]
rootfilesin:accessible/jsat
rootfilesin:accessible/tests/mochitest/jsat
rootfilesin:mobile/android/chrome/content
rootfilesin:mobile/android/modules/geckoview
rootfilesin:third_party/rust/log
[exclude]
This narrowspec was chosen due to the size of the third_party/rust directory
(this directory was *not* modified in revision #468856 in mozilla-unified),
plus all the directories that *were* modified in revision #468856 of
mozilla-unified.
Importantly, when using narrow, these repos had everything checked out (in the
case of large-dir-repo, that means all 30,001 directories), *before* adding the
narrowspec. This is to simulate the behavior when using a virtual filesystem
that shows everything for the user even if they haven't added it to the
narrowspec yet. This is not a supported configuration, and `hg update` and `hg
rebase` will not really do the "correct" thing if there are mutations outside
of the narrowspec (which is not the case in these tests, due to a carefully
crafted narrowspec), but non-mutating commands should behave correctly.
I'm not claiming anything less than a 5% speed win as improvements due to this
change; these are probably eiter measurement artifacts or constant time
improvements. The numbers that aren't changing are shown primarily to prove that
this doesn't make anything worse in any case I plan on testing during this
series.
'before' is hg from commit 6268fed3
'N' indicates narrow in use
'T' indicates treemanifest in use
Please note that these commands and the narrowspec are a little different than
the ones in a similar table that I made in a3cabe9415e1.
Important: it is my understanding that these numbers below are *not super reliable*,
the large slowdowns may be artifacts of some odd interaction between GC and
python module/code complexity. Another changeset of mine (D4351) had shown large
timing differences when ~empty, uncalled functions were added to match.py,
though only when using --color=never or redirecting to /dev/null. We seem to be
on some cusp of complexity or code size that is causing, at my best guess
(according to linux `perf` benchmarks) GC to alter behavior and cause a
200-400ms difference in timings. I haven't had a chance to replicate these
results on another machine.
diff --git:
repo | N | T | before (mean +- stdev) | after (mean +- stdev) | % of before
------+---+---+------------------------+-----------------------+------------
m-u | | | 1.580 s +- 0.034 s | 1.576 s +- 0.022 s | 99.7%
m-u | | x | 1.568 s +- 0.025 s | 1.584 s +- 0.044 s | 101.0%
m-u | x | | 1.569 s +- 0.031 s | 1.554 s +- 0.025 s | 99.0%
m-u | x | x | 107.3 ms +- 1.6 ms | 106.3 ms +- 1.5 ms | 99.1%
l-d-r | | | 232.5 ms +- 5.9 ms | 233.5 ms +- 5.3 ms | 100.4%
l-d-r | | x | 236.6 ms +- 6.3 ms | 233.6 ms +- 7.0 ms | 98.7%
l-d-r | x | | 118.4 ms +- 2.1 ms | 118.4 ms +- 1.4 ms | 100.0%
l-d-r | x | x | 116.8 ms +- 1.5 ms | 118.9 ms +- 1.6 ms | 101.8%
diff -c . --git:
repo | N | T | before (mean +- stdev) | after (mean +- stdev) | % of before
------+---+---+------------------------+-----------------------+------------
m-u | | | 354.4 ms +- 16.6 ms | 351.0 ms +- 6.9 ms | 99.0%
m-u | | x | 207.2 ms +- 3.0 ms | 206.2 ms +- 2.7 ms | 99.5%
m-u | x | | 422.0 ms +- 26.0 ms | 351.2 ms +- 6.4 ms | 83.2% <--
m-u | x | x | 166.7 ms +- 2.1 ms | 169.5 ms +- 4.1 ms | 101.7%
l-d-r | | | 98.4 ms +- 4.5 ms | 98.5 ms +- 2.1 ms | 100.1%
l-d-r | | x | 5.519 s +- 0.060 s | 5.149 s +- 0.042 s | 93.3% <--
l-d-r | x | | 99.1 ms +- 3.2 ms | 102.6 ms +- 9.7 ms | 103.5% <--?
l-d-r | x | x | 994.9 ms +- 10.7 ms | 1.026 s +- 0.012 s | 103.1% <--?
rebase -r . --keep -d .^^:
repo | N | T | before (mean +- stdev) | after (mean +- stdev) | % of before
------+---+---+------------------------+-----------------------+------------
m-u | | | 6.639 s +- 0.168 s | 6.559 s +- 0.097 s | 98.8%
m-u | | x | 6.601 s +- 0.143 s | 6.640 s +- 0.207 s | 100.6%
m-u | x | | 6.582 s +- 0.098 s | 6.543 s +- 0.098 s | 99.4%
m-u | x | x | 678.4 ms +- 57.7 ms | 703.7 ms +- 52.4 ms | 103.7% <--?
l-d-r | | | 780.0 ms +- 23.9 ms | 776.0 ms +- 12.6 ms | 99.5%
l-d-r | | x | 7.520 s +- 0.255 s | 7.395 s +- 0.044 s | 98.3%
l-d-r | x | | 331.9 ms +- 16.5 ms | 327.0 ms +- 3.4 ms | 98.5%
l-d-r | x | x | 6.228 s +- 0.113 s | 5.924 s +- 0.044 s | 95.1%
status --change . --copies:
repo | N | T | before (mean +- stdev) | after (mean +- stdev) | % of before
------+---+---+------------------------+-----------------------+------------
m-u | | | 330.8 ms +- 7.2 ms | 329.0 ms +- 7.1 ms | 99.5%
m-u | | x | 182.9 ms +- 2.7 ms | 183.5 ms +- 2.7 ms | 100.3%
m-u | x | | 330.0 ms +- 7.6 ms | 327.1 ms +- 5.4 ms | 99.1%
m-u | x | x | 146.2 ms +- 2.4 ms | 147.1 ms +- 1.3 ms | 100.6%
l-d-r | | | 95.3 ms +- 1.4 ms | 95.9 ms +- 1.5 ms | 100.6%
l-d-r | | x | 5.157 s +- 0.035 s | 5.166 s +- 0.058 s | 100.2%
l-d-r | x | | 99.7 ms +- 3.0 ms | 100.2 ms +- 4.4 ms | 100.5%
l-d-r | x | x | 993.6 ms +- 13.1 ms | 1.025 s +- 0.015 s | 103.2% <--?
status --copies:
repo | N | T | before (mean +- stdev) | after (mean +- stdev) | % of before
------+---+---+------------------------+-----------------------+------------
m-u | | | 2.348 s +- 0.031 s | 2.329 s +- 0.019 s | 99.2%
m-u | | x | 2.337 s +- 0.026 s | 2.346 s +- 0.034 s | 100.4%
m-u | x | | 2.354 s +- 0.015 s | 2.342 s +- 0.021 s | 99.5%
m-u | x | x | 120.6 ms +- 4.3 ms | 119.2 ms +- 2.1 ms | 98.8%
l-d-r | | | 731.5 ms +- 11.1 ms | 719.6 ms +- 9.8 ms | 98.4%
l-d-r | | x | 729.0 ms +- 15.5 ms | 725.7 ms +- 10.6 ms | 99.5%
l-d-r | x | | 211.0 ms +- 3.9 ms | 212.8 ms +- 3.7 ms | 100.9%
l-d-r | x | x | 211.5 ms +- 4.2 ms | 211.0 ms +- 3.3 ms | 99.8%
update $rev^; ~/src/hg/hg{hg}/hg update $rev:
repo | N | T | before (mean +- stdev) | after (mean +- stdev) | % of before
------+---+---+------------------------+-----------------------+------------
m-u | | | 3.910 s +- 0.055 s | 3.920 s +- 0.075 s | 100.3%
m-u | | x | 3.613 s +- 0.056 s | 3.630 s +- 0.056 s | 100.5%
m-u | x | | 3.873 s +- 0.055 s | 3.864 s +- 0.049 s | 99.8%
m-u | x | x | 400.4 ms +- 7.4 ms | 403.6 ms +- 5.0 ms | 100.8%
l-d-r | | | 531.6 ms +- 10.0 ms | 528.8 ms +- 9.6 ms | 99.5%
l-d-r | | x | 10.377 s +- 0.049 s | 9.955 s +- 0.046 s | 95.9%
l-d-r | x | | 308.3 ms +- 4.4 ms | 306.8 ms +- 3.7 ms | 99.5%
l-d-r | x | x | 1.805 s +- 0.015 s | 1.834 s +- 0.020 s | 101.6%
Differential Revision: https://phab.mercurial-scm.org/D4366
author | spectral <spectral@google.com> |
---|---|
date | Thu, 16 Aug 2018 12:31:52 -0700 |
parents | 1b230e19d044 |
children | 2372284d9457 |
line wrap: on
line source
# This is a randomized test that generates different pathnames every # time it is invoked, and tests the encoding of those pathnames. # # It uses a simple probabilistic model to generate valid pathnames # that have proven likely to expose bugs and divergent behavior in # different encoding implementations. from __future__ import absolute_import, print_function import binascii import collections import itertools import math import os import random import sys import time from mercurial import ( pycompat, store, ) try: xrange except NameError: xrange = range validchars = set(map(pycompat.bytechr, range(0, 256))) alphanum = range(ord('A'), ord('Z')) for c in (b'\0', b'/'): validchars.remove(c) winreserved = (b'aux con prn nul'.split() + [b'com%d' % i for i in xrange(1, 10)] + [b'lpt%d' % i for i in xrange(1, 10)]) def casecombinations(names): '''Build all case-diddled combinations of names.''' combos = set() for r in names: for i in xrange(len(r) + 1): for c in itertools.combinations(xrange(len(r)), i): d = r for j in c: d = b''.join((d[:j], d[j:j + 1].upper(), d[j + 1:])) combos.add(d) return sorted(combos) def buildprobtable(fp, cmd='hg manifest tip'): '''Construct and print a table of probabilities for path name components. The numbers are percentages.''' counts = collections.defaultdict(lambda: 0) for line in os.popen(cmd).read().splitlines(): if line[-2:] in ('.i', '.d'): line = line[:-2] if line.startswith('data/'): line = line[5:] for c in line: counts[c] += 1 for c in '\r/\n': counts.pop(c, None) t = sum(counts.itervalues()) / 100.0 fp.write('probtable = (') for i, (k, v) in enumerate(sorted(counts.items(), key=lambda x: x[1], reverse=True)): if (i % 5) == 0: fp.write('\n ') vt = v / t if vt < 0.0005: break fp.write('(%r, %.03f), ' % (k, vt)) fp.write('\n )\n') # A table of character frequencies (as percentages), gleaned by # looking at filelog names from a real-world, very large repo. probtable = ( (b't', 9.828), (b'e', 9.042), (b's', 8.011), (b'a', 6.801), (b'i', 6.618), (b'g', 5.053), (b'r', 5.030), (b'o', 4.887), (b'p', 4.363), (b'n', 4.258), (b'l', 3.830), (b'h', 3.693), (b'_', 3.659), (b'.', 3.377), (b'm', 3.194), (b'u', 2.364), (b'd', 2.296), (b'c', 2.163), (b'b', 1.739), (b'f', 1.625), (b'6', 0.666), (b'j', 0.610), (b'y', 0.554), (b'x', 0.487), (b'w', 0.477), (b'k', 0.476), (b'v', 0.473), (b'3', 0.336), (b'1', 0.335), (b'2', 0.326), (b'4', 0.310), (b'5', 0.305), (b'9', 0.302), (b'8', 0.300), (b'7', 0.299), (b'q', 0.298), (b'0', 0.250), (b'z', 0.223), (b'-', 0.118), (b'C', 0.095), (b'T', 0.087), (b'F', 0.085), (b'B', 0.077), (b'S', 0.076), (b'P', 0.076), (b'L', 0.059), (b'A', 0.058), (b'N', 0.051), (b'D', 0.049), (b'M', 0.046), (b'E', 0.039), (b'I', 0.035), (b'R', 0.035), (b'G', 0.028), (b'U', 0.026), (b'W', 0.025), (b'O', 0.017), (b'V', 0.015), (b'H', 0.013), (b'Q', 0.011), (b'J', 0.007), (b'K', 0.005), (b'+', 0.004), (b'X', 0.003), (b'Y', 0.001), ) for c, _ in probtable: validchars.remove(c) validchars = list(validchars) def pickfrom(rng, table): c = 0 r = rng.random() * sum(i[1] for i in table) for i, p in table: c += p if c >= r: return i reservedcombos = casecombinations(winreserved) # The first component of a name following a slash. firsttable = ( (lambda rng: pickfrom(rng, probtable), 90), (lambda rng: rng.choice(validchars), 5), (lambda rng: rng.choice(reservedcombos), 5), ) # Components of a name following the first. resttable = firsttable[:-1] # Special suffixes. internalsuffixcombos = casecombinations(b'.hg .i .d'.split()) # The last component of a path, before a slash or at the end of a name. lasttable = resttable + ( (lambda rng: b'', 95), (lambda rng: rng.choice(internalsuffixcombos), 5), ) def makepart(rng, k): '''Construct a part of a pathname, without slashes.''' p = pickfrom(rng, firsttable)(rng) l = len(p) ps = [p] maxl = rng.randint(1, k) while l < maxl: p = pickfrom(rng, resttable)(rng) l += len(p) ps.append(p) ps.append(pickfrom(rng, lasttable)(rng)) return b''.join(ps) def makepath(rng, j, k): '''Construct a complete pathname.''' return (b'data/' + b'/'.join(makepart(rng, k) for _ in xrange(j)) + rng.choice([b'.d', b'.i'])) def genpath(rng, count): '''Generate random pathnames with gradually increasing lengths.''' mink, maxk = 1, 4096 def steps(): for i in xrange(count): yield mink + int(round(math.sqrt((maxk - mink) * float(i) / count))) for k in steps(): x = rng.randint(1, k) y = rng.randint(1, k) yield makepath(rng, x, y) def runtests(rng, seed, count): nerrs = 0 for p in genpath(rng, count): h = store._pathencode(p) # uses C implementation, if available r = store._hybridencode(p, True) # reference implementation in Python if h != r: if nerrs == 0: print('seed:', hex(seed)[:-1], file=sys.stderr) print("\np: '%s'" % p.encode("string_escape"), file=sys.stderr) print("h: '%s'" % h.encode("string_escape"), file=sys.stderr) print("r: '%s'" % r.encode("string_escape"), file=sys.stderr) nerrs += 1 return nerrs def main(): import getopt # Empirically observed to take about a second to run count = 100 seed = None opts, args = getopt.getopt(sys.argv[1:], 'c:s:', ['build', 'count=', 'seed=']) for o, a in opts: if o in ('-c', '--count'): count = int(a) elif o in ('-s', '--seed'): seed = int(a, base=0) # accepts base 10 or 16 strings elif o == '--build': buildprobtable(sys.stdout, 'find .hg/store/data -type f && ' 'cat .hg/store/fncache 2>/dev/null') sys.exit(0) if seed is None: try: seed = int(binascii.hexlify(os.urandom(16)), 16) except AttributeError: seed = int(time.time() * 1000) rng = random.Random(seed) if runtests(rng, seed, count): sys.exit(1) if __name__ == '__main__': main()