Fri, 13 Jan 2017 19:58:00 -0800 revlog: use compression engine APIs for decompression
Gregory Szorc <gregory.szorc@gmail.com> [Fri, 13 Jan 2017 19:58:00 -0800] rev 30817
revlog: use compression engine APIs for decompression Now that compression engines declare their header in revlog chunks and can decompress revlog chunks, we refactor revlog.decompress() to use them. Making full use of the property that revlog compressor objects are reusable, revlog instances now maintain a dict mapping an engine's revlog header to a compressor object. This is not only a performance optimization for engines where compressor object reuse can result in better performance, but it also serves as a cache of header values so we don't need to perform redundant lookups against the compression engine manager. (Yes, I measured and the overhead of a function call versus a dict lookup was observed.) Replacing the previous inline lookup table with a dict lookup was measured to make chunk reading ~2.5% slower on changelogs and ~4.5% slower on manifests. So, the inline lookup table has been mostly preserved so we don't lose performance. This is unfortunate. But many decompression operations complete in microseconds, so Python attribute lookup, dict lookup, and function calls do matter. The impact of this change on mozilla-unified is as follows: $ hg perfrevlogchunks -c ! chunk ! wall 1.953663 comb 1.950000 user 1.920000 sys 0.030000 (best of 6) ! wall 1.946000 comb 1.940000 user 1.910000 sys 0.030000 (best of 6) ! chunk batch ! wall 1.791075 comb 1.800000 user 1.760000 sys 0.040000 (best of 6) ! wall 1.785690 comb 1.770000 user 1.750000 sys 0.020000 (best of 6) $ hg perfrevlogchunks -m ! chunk ! wall 2.587262 comb 2.580000 user 2.550000 sys 0.030000 (best of 4) ! wall 2.616330 comb 2.610000 user 2.560000 sys 0.050000 (best of 4) ! chunk batch ! wall 2.427092 comb 2.420000 user 2.400000 sys 0.020000 (best of 5) ! wall 2.462061 comb 2.460000 user 2.400000 sys 0.060000 (best of 4) Changelog chunk reading is slightly faster but manifest reading is slower. What gives? On this repo, 99.85% of changelog entries are zlib compressed (the 'x' header). On the manifest, 67.5% are zlib and 32.4% are '\0'. This patch swapped the test order of 'x' and '\0' so now 'x' is tested first. This makes changelogs faster since they almost always hit the first branch. This makes a significant percentage of manifest '\0' chunks slower because that code path now performs an extra test. Yes, I too can't believe we're able to measure the impact of an if..elif with simple string compares. I reckon this code would benefit from being written in C...
Fri, 13 Jan 2017 10:22:25 +0100 hgweb: build the "entries" list directly in filelog command
Denis Laxalde <denis.laxalde@logilab.fr> [Fri, 13 Jan 2017 10:22:25 +0100] rev 30816
hgweb: build the "entries" list directly in filelog command There's no apparent reason to have this "entries" generator function that builds a list and then yields its elements in reverse order and which is only called to build the "entries" list. So just build the list directly, in reverse order. Adjust "parity" generator's offset to keep rendering the same.
Sat, 14 Jan 2017 10:11:19 -0800 convert: remove "replacecommitter" action
Gregory Szorc <gregory.szorc@gmail.com> [Sat, 14 Jan 2017 10:11:19 -0800] rev 30815
convert: remove "replacecommitter" action As pointed out by Yuya, this action doesn't add much (any?) value.
Sat, 14 Jan 2017 20:31:35 +0900 ui: check EOF of getpass() response read from command-server channel
Yuya Nishihara <yuya@tcha.org> [Sat, 14 Jan 2017 20:31:35 +0900] rev 30814
ui: check EOF of getpass() response read from command-server channel readline() returns '' only when EOF is encountered, in which case, Python's getpass() raises EOFError. We should do the same to abort the session as "response expected." This bug was reported to https://bitbucket.org/tortoisehg/thg/issues/4659/
Fri, 13 Jan 2017 23:21:10 -0800 convert: config option to control Git committer actions
Gregory Szorc <gregory.szorc@gmail.com> [Fri, 13 Jan 2017 23:21:10 -0800] rev 30813
convert: config option to control Git committer actions When converting a Git repository to Mercurial at Mozilla, I encountered a scenario where I didn't want `hg convert` to automatically add the "committer: <committer>" line to commit messages. While I can hack around this by rewriting the Git commit before it is fed into `hg convert`, I figured it would be a useful knob to control. This patch introduces a config option that allows lots of control over the committer value. I initially implemented this as a single boolean flag to control whether to save the committer message. But then there was feedback that it would be useful to save the committer in extra data. While this patch doesn't implement support for saving in extra data, it does add a mechanism for extending which actions to take on the committer field. We should be able to easily add actions to save in extra data. Some of the implemented features weren't asked for. But I figured they could be useful. If nothing else they demonstrate the extensibility of this mechanism.
Fri, 13 Jan 2017 21:21:02 -0800 help: make "mergetool" an alias for "merge-tools"
Gregory Szorc <gregory.szorc@gmail.com> [Fri, 13 Jan 2017 21:21:02 -0800] rev 30812
help: make "mergetool" an alias for "merge-tools" I've probably typed `hg help mergetool` dozens of times. I'm tired of it not working.
Thu, 12 Jan 2017 21:06:55 +0900 templatekw: force noprefix=False to insure diffstat consistency (issue4755)
Matthieu Laneuville <mlaneuville@protonmail.com> [Thu, 12 Jan 2017 21:06:55 +0900] rev 30811
templatekw: force noprefix=False to insure diffstat consistency (issue4755) The result of diffstatdata should not depend on having noprefix set or not, as was reported in issue 4755. Forcing noprefix to false on call makes sure the parser receives the diff in the correct format and returns the proper result. Another way to fix this would have been to change the regular expressions in path.diffstatdata(), but that would have introduced many unecessary special cases.
Fri, 13 Jan 2017 10:11:37 -0800 check-code: reject module-level @cachefunc
Martin von Zweigbergk <martinvonz@google.com> [Fri, 13 Jan 2017 10:11:37 -0800] rev 30810
check-code: reject module-level @cachefunc Module-level @cachefunc usage is risky because it can easily create a memory "leak". Let's reject it completely for now. If a valid usage comes up in the future, we can always improve the check or reconsider.
Fri, 13 Jan 2017 11:42:36 -0800 similar: remove caching from the module level
Pierre-Yves David <pierre-yves.david@ens-lyon.org> [Fri, 13 Jan 2017 11:42:36 -0800] rev 30809
similar: remove caching from the module level To prevent Bad Thingsā„¢ from happening, let's rework the logic to not use util.cachefunc.
Mon, 09 Jan 2017 11:01:45 -0800 patch: add label for coloring the similarity extended header
Sean Farley <sean@farley.io> [Mon, 09 Jan 2017 11:01:45 -0800] rev 30808
patch: add label for coloring the similarity extended header Just like the summary says, this will colorize the: similarity index 88% line in the diff output.
Mon, 09 Jan 2017 11:24:18 -0800 patch: use opt.showsimilarity to calculate and show the similarity
Sean Farley <sean@farley.io> [Mon, 09 Jan 2017 11:24:18 -0800] rev 30807
patch: use opt.showsimilarity to calculate and show the similarity Tests have been added.
Mon, 09 Jan 2017 10:51:44 -0800 patch: add similarity config knob in experimental section
Sean Farley <sean@farley.io> [Mon, 09 Jan 2017 10:51:44 -0800] rev 30806
patch: add similarity config knob in experimental section This config knob will control whether or not to show the similarity calculation in the diff output: diff --git a/README.md b/foo.md similarity index 88% rename from README.md rename to foo.md --- a/README.md +++ b/foo.md
Sat, 07 Jan 2017 20:47:57 -0800 similar: move score function to module level
Sean Farley <sean@farley.io> [Sat, 07 Jan 2017 20:47:57 -0800] rev 30805
similar: move score function to module level Future patches will use this to report the similarity of a rename / copy in the patch output.
Mon, 09 Jan 2017 17:58:19 +0900 revset: abuse x:y syntax to specify line range of followlines()
Yuya Nishihara <yuya@tcha.org> [Mon, 09 Jan 2017 17:58:19 +0900] rev 30804
revset: abuse x:y syntax to specify line range of followlines() This slightly complicates the parsing (see the previous patch), but the overall result seems not bad. I keep x:, :y and : for future extension.
Mon, 09 Jan 2017 16:55:56 +0900 revset: do not transform range* operators in parsed tree
Yuya Nishihara <yuya@tcha.org> [Mon, 09 Jan 2017 16:55:56 +0900] rev 30803
revset: do not transform range* operators in parsed tree This allows us to handle x:y range as a general range object. A primary user of it is followlines().
Mon, 09 Jan 2017 17:45:11 +0900 revset: add default value to getinteger() helper
Yuya Nishihara <yuya@tcha.org> [Mon, 09 Jan 2017 17:45:11 +0900] rev 30802
revset: add default value to getinteger() helper This seems handy.
Mon, 09 Jan 2017 17:39:44 +0900 revset: factor out getinteger() helper
Yuya Nishihara <yuya@tcha.org> [Mon, 09 Jan 2017 17:39:44 +0900] rev 30801
revset: factor out getinteger() helper We have 4 revset functions that take integer arguments, and they handle their arguments in slightly different ways. This patch unifies them: - getstring() in place of getsymbol(), which is more consistent with the handling of integer revisions (both 1 and '1' are valid) - say "expects" instead of "requires" for type errors We don't need to catch TypeError since getstring() must return a string.
Mon, 09 Jan 2017 16:16:26 +0900 revset: rename rev argument of followlines() to startrev
Yuya Nishihara <yuya@tcha.org> [Mon, 09 Jan 2017 16:16:26 +0900] rev 30800
revset: rename rev argument of followlines() to startrev The rev argument has the same meaning as startrev of follow(), and I think startrev is more informative. followlines() is new function, we can make BC now.
Fri, 13 Jan 2017 23:48:21 +0900 help: use :hg: role and canonical name to point to revset string patterns
Yuya Nishihara <yuya@tcha.org> [Fri, 13 Jan 2017 23:48:21 +0900] rev 30799
help: use :hg: role and canonical name to point to revset string patterns Follows up 5dd67f0993ce. Now revisions.txt and revsets.txt has been merged, so use revisions.* as a pointer.
Mon, 02 Jan 2017 13:27:20 -0800 util: compression APIs to support revlog decompression
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 02 Jan 2017 13:27:20 -0800] rev 30798
util: compression APIs to support revlog decompression Previously, compression engines had APIs for performing revlog compression but no mechanism to perform revlog decompression. This patch changes that. Revlog decompression is slightly more complicated than compression because in the compression case there is (currently) only a single engine that can be used at a time. However for decompression, a revlog could contain chunks from multiple compression engines. This means decompression needs to map to multiple engines and decompressors. This functionality is outside the scope of this patch. But it drives the decision for engines to declare a byte header sequence that identifies revlog data as belonging to an engine and an API for obtaining an engine from a revlog header.
Sun, 08 Jan 2017 10:08:29 +0800 crecord: add an experimental option for space key to move cursor down
Anton Shestakov <av6@dwimlabs.net> [Sun, 08 Jan 2017 10:08:29 +0800] rev 30797
crecord: add an experimental option for space key to move cursor down I really want to have an option of toggling a selection on a line and also moving cursor down as a single keystroke. It also kinda makes sense for space key to do this, because some other curses UIs in the wild do this (e.g. various file managers, htop). So I got an idea to make a config option that defaults to False for compatibility, but allows making crecord UI a lot more useful for people with big hunks. We add this an experimental option to experiment with this behavior.
Mon, 02 Jan 2017 12:02:08 -0800 perf: support multiple compression engines in perfrevlogchunks
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 02 Jan 2017 12:02:08 -0800] rev 30796
perf: support multiple compression engines in perfrevlogchunks Now that the revlog has a reference to a compressor, it is possible to swap in other compression engines. So, teach `hg perfrevlogchunks` to do that. The default behavior of `hg perfrevlogchunks` is now to measure the compression performance of all compression engines implementing the revlog compressor API. This effectively adds the no-op "none" compressor and zstd (when available) into the default set. While we can't yet plug alternate compressors into revlogs, this command gives us a preview of the performance. On the mozilla-unified repository: $ hg perfrevlogchunks -c ! compress w/ none ! wall 0.115159 comb 0.110000 user 0.110000 sys 0.000000 (best of 86) ! compress w/ zlib ! wall 5.681406 comb 5.680000 user 5.680000 sys 0.000000 (best of 3) ! compress w/ zstd ! wall 2.624781 comb 2.620000 user 2.620000 sys 0.000000 (best of 4) $ hg perfrevlogchunks -m ! compress w/ none ! wall 0.124486 comb 0.120000 user 0.120000 sys 0.000000 (best of 79) ! compress w/ zlib ! wall 10.144701 comb 10.150000 user 10.150000 sys 0.000000 (best of 3) ! compress w/ zstd ! wall 4.383118 comb 4.390000 user 4.390000 sys 0.000000 (best of 3) Those numbers for zstd look promising. But they aren't the full story. For that, we'll need to look at decompression times and storage sizes. Stay tuned...
Mon, 02 Jan 2017 11:22:52 -0800 revlog: use compression engine API for compression
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 02 Jan 2017 11:22:52 -0800] rev 30795
revlog: use compression engine API for compression This commit swaps in the just-added revlog compressor API into the revlog class. Instead of implementing zlib compression inline in compress(), we now store a cached-on-first-use revlog compressor on each revlog instance and invoke its "compress()" method. As part of this, revlog.compress() has been refactored a bit to use a cleaner code flow and modern formatting (e.g. avoiding parenthesis around returned tuples). On a mozilla-unified repo, here are the "compress" times for a few commands: $ hg perfrevlogchunks -c ! wall 5.772450 comb 5.780000 user 5.780000 sys 0.000000 (best of 3) ! wall 5.795158 comb 5.790000 user 5.790000 sys 0.000000 (best of 3) $ hg perfrevlogchunks -m ! wall 9.975789 comb 9.970000 user 9.970000 sys 0.000000 (best of 3) ! wall 10.019505 comb 10.010000 user 10.010000 sys 0.000000 (best of 3) Compression times did seem to slow down just a little. There are 360,210 changelog revisions and 359,342 manifest revisions. For the changelog, mean time to compress a revision increased from ~16.025us to ~16.088us. That's basically a function call or an attribute lookup. I suppose this is the price you pay for abstraction. It's so low that I'm not concerned.
Mon, 02 Jan 2017 12:39:03 -0800 util: compression APIs to support revlog compression
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 02 Jan 2017 12:39:03 -0800] rev 30794
util: compression APIs to support revlog compression As part of "zstd all of the things," we need to teach revlogs to use non-zlib compression formats. Because we're routing all compression via the "compression manager" and "compression engine" APIs, we need to introduction functionality there for performing revlog operations. Ideally, revlog compression and decompression operations would be implemented in terms of simple "compress" and "decompress" primitives. However, there are a few considerations that make us want to have a specialized primitive for handling revlogs: 1) Performance. Revlogs tend to do compression and especially decompression operations in batches. Any overhead for e.g. instantiating a "context" for performing an operation can be noticed. For this reason, our "revlog compressor" primitive is reusable. For zstd, we reuse the same compression "context" for multiple operations. I've measured this to have a performance impact versus constructing new contexts for each operation. 2) Specialization. By having a primitive dedicated to revlog use, we can make revlog-specific choices and leave the door open for more functionality in the future. For example, the zstd revlog compressor may one day make use of dictionary compression. A future patch will introduce a decompress() on the compressor object. The code for the zlib compressor is basically copied from revlog.compress(). Although it doesn't handle the empty input case, the null first byte case, and the 'u' prefix case. These cases will continue to be handled in revlog.py once that code is ported to use this API.
Mon, 02 Jan 2017 13:00:16 -0800 revlog: move decompress() from module to revlog class (API)
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 02 Jan 2017 13:00:16 -0800] rev 30793
revlog: move decompress() from module to revlog class (API) Upcoming patches will convert revlogs to use the compression engine APIs to perform all things compression. The yet-to-be-introduced APIs support a persistent "compressor" object so the same object can be reused for multiple compression operations, leading to better performance. In addition, compression engines like zstd may wish to tweak compression engine state based on the revlog (e.g. per-revlog compression dictionaries). A global and shared decompress() function will shortly no longer make much sense. So, we move decompress() to be a method of the revlog class. It joins compress() there. On the mozilla-unified repo, we can measure the impact of this change on reading performance: $ hg perfrevlogchunks -c ! chunk ! wall 1.932573 comb 1.930000 user 1.900000 sys 0.030000 (best of 6) ! wall 1.955183 comb 1.960000 user 1.930000 sys 0.030000 (best of 6) ! chunk batch ! wall 1.787879 comb 1.780000 user 1.770000 sys 0.010000 (best of 6 ! wall 1.774444 comb 1.770000 user 1.750000 sys 0.020000 (best of 6) "chunk" appeared to become slower but "chunk batch" got faster. Upon further examination by running both sets multiple times, the numbers appear to converge across all runs. This tells me that there is no perceived performance impact to this refactor.
Mon, 02 Jan 2017 11:50:17 -0800 revlog: make compressed size comparisons consistent
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 02 Jan 2017 11:50:17 -0800] rev 30792
revlog: make compressed size comparisons consistent revlog.compress() compares the compressed size to the input size and throws away the compressed data if it is larger than the input. This is the correct thing to do, as storing compressed data that is larger than the input takes up more storage space and makes reading slower. However, the comparison was implemented inconsistently. For the streaming compression mode, we threw away the result if it was greater than or equal to the input size. But for the one-shot compression, we threw away the compression only if it was greater than the input size! This patch changes the comparison for the simple case so it is consistent with the streaming case. As a few tests demonstrate, this adds 1 byte to some revlog entries. This is because of an added 'u' header on the chunk. It seems somewhat wrong to increase the revlog size here. However, IMO the cost of 1 byte in storage is insignificant compared to the performance gains of avoiding decompression. This patch should invite questions around the heuristic for throwing away compressed data. For example, I'd argue we should be more liberal about rejecting compressed data, additionally doing so where the number of bytes saved fails to reach a threshold. But we can have this discussion another time.
Sat, 07 Jan 2017 20:43:49 -0800 similar: rename local variable to not collide with previous
Sean Farley <sean@farley.io> [Sat, 07 Jan 2017 20:43:49 -0800] rev 30791
similar: rename local variable to not collide with previous Future patches will move the score function to the module level, so let's not shadow that.
Mon, 09 Jan 2017 10:59:45 -0800 patch: add label for coloring the index extended header
Sean Farley <sean@farley.io> [Mon, 09 Jan 2017 10:59:45 -0800] rev 30790
patch: add label for coloring the index extended header Just like the summary says, this will colorize the: index 3d3ba4b65e11..57274a0f46b2 100644 line in the diff output.
Sat, 31 Dec 2016 15:41:57 -0600 patch: add index line for diff output
Sean Farley <sean@farley.io> [Sat, 31 Dec 2016 15:41:57 -0600] rev 30789
patch: add index line for diff output This helps highlighting in third-party diff coloring (which assumes git output) and maintains pedantic correctness with diff --git. Tests will be added at the end of the series.
Mon, 09 Jan 2017 11:13:47 -0800 patch: add config knob for displaying the index header
Sean Farley <sean@farley.io> [Mon, 09 Jan 2017 11:13:47 -0800] rev 30788
patch: add config knob for displaying the index header This config knob can take an integer between 0 and 40 or a keyword ('none', 'short', 'full') to control the length of hash to output. It will display diffs with the git index header as such, diff --git a/mercurial/mdiff.py b/mercurial/mdiff.py index 112edf7..d6b52c5 100644 We'll put this in the experimental section for now.
Thu, 12 Jan 2017 12:05:23 -0800 bisect: refer directly to bisect() revset predicate in help
Martin von Zweigbergk <martinvonz@google.com> [Thu, 12 Jan 2017 12:05:23 -0800] rev 30787
bisect: refer directly to bisect() revset predicate in help We have specific syntax for displaying the help text for a particular revset predicate, so let's refer directly to the bisect() revset in the verbose bisect help. It seems likely that the user doesn't care about other revsets at that point, so they will probably not miss the text about the other revset predicates.
Thu, 12 Jan 2017 11:43:26 -0800 tests: use "hg help revisions.<predicate>" instead of grepping
Martin von Zweigbergk <martinvonz@google.com> [Thu, 12 Jan 2017 11:43:26 -0800] rev 30786
tests: use "hg help revisions.<predicate>" instead of grepping We have specific syntax for displaying the help text for a particular revset predicate, so use that instead of grepping through the full output.
Thu, 12 Jan 2017 11:52:05 -0800 help: remove now-redundant pointer to revsets help
Martin von Zweigbergk <martinvonz@google.com> [Thu, 12 Jan 2017 11:52:05 -0800] rev 30785
help: remove now-redundant pointer to revsets help "hg help revisions" and "hg help revsets" now point to the same text, so drop the revsets reference.
Sat, 07 Jan 2017 23:35:35 -0500 help: eliminate duplicate text for revset string patterns
Matt Harbison <matt_harbison@yahoo.com> [Sat, 07 Jan 2017 23:35:35 -0500] rev 30784
help: eliminate duplicate text for revset string patterns There's no reason to duplicate this so many times, and it's likely an instance will be missed if support for a new pattern is added and documented. The stringmatcher is mostly used by revsets, though it is also used for the 'tag' related templates, and namespace filtering in the journal extension. So maybe there's a better place to document it. `hg help patterns` seems inappropriate, because that is all file pattern matching. While here, indicate how to perform case insensitive regex searches.
Sat, 07 Jan 2017 21:26:32 -0500 revset: add regular expression support to 'desc'
Matt Harbison <matt_harbison@yahoo.com> [Sat, 07 Jan 2017 21:26:32 -0500] rev 30783
revset: add regular expression support to 'desc' This is a case insensitive predicate like 'author', so it conforms to the existing behavior of performing a case insensitive regex.
Wed, 11 Jan 2017 22:42:10 -0500 revset: stop lowercasing the regex pattern for 'author'
Matt Harbison <matt_harbison@yahoo.com> [Wed, 11 Jan 2017 22:42:10 -0500] rev 30782
revset: stop lowercasing the regex pattern for 'author' It was probably unintentional for regex, as the meaning of some sequences like \S and \s is actually inverted by changing the case. For backward compatibility however, the matching is forced to case insensitive.
Thu, 24 Nov 2016 18:45:29 -0800 repair: clean up stale lock file from store backup
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 24 Nov 2016 18:45:29 -0800] rev 30781
repair: clean up stale lock file from store backup Since we did a directory rename on the stores, the source repository's lock path now references the dest repository's lock path and the dest repository's lock path now references a non-existent filename. So releasing the lock on the source will unlock the dest and releasing the lock on the dest will no-op because it fails due to file not found. So we clean up the dest's lock manually.
Thu, 24 Nov 2016 18:34:50 -0800 repair: copy non-revlog store files during upgrade
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 24 Nov 2016 18:34:50 -0800] rev 30780
repair: copy non-revlog store files during upgrade The store contains more than just revlogs. This patch teaches the upgrade code to copy regular files as well. As the test changes demonstrate, the phaseroots file is now copied.
Sun, 18 Dec 2016 17:00:15 -0800 repair: migrate revlogs during upgrade
Gregory Szorc <gregory.szorc@gmail.com> [Sun, 18 Dec 2016 17:00:15 -0800] rev 30779
repair: migrate revlogs during upgrade Our next step for in-place upgrade is to migrate store data. Revlogs are the biggest source of data within the store and a store is useless without them, so we implement their migration first. Our strategy for migrating revlogs is to walk the store and call `revlog.clone()` on each revlog. There are some minor complications. Because revlogs have different storage options (e.g. changelog has generaldelta and delta chains disabled), we need to obtain the correct class of revlog so inserted data is encoded properly for its type. Various attempts at implementing progress indicators that didn't lead to frustration from false "it's almost done" indicators were made. I initially used a single progress bar based on number of revlogs. However, this quickly churned through all filelogs, got to 99% then effectively froze at 99.99% when it got to the manifest. So I converted the progress bar to total revision count. This was a little bit better. But the manifest was still significantly slower than filelogs and it took forever to process the last few percent. I then tried both revision/chunk bytes and raw bytes as the denominator. This had the opposite effect: because so much data is in manifests, it would churn through filelogs without showing much progress. When it got to manifests, it would fill in 90+% of the progress bar. I finally gave up having a unified progress bar and instead implemented 3 progress bars: 1 for filelog revisions, 1 for manifest revisions, and 1 for changelog revisions. I added extra messages indicating the total number of revisions of each so users know there are more progress bars coming. I also added extra messages before and after each stage to give extra details about what is happening. Strictly speaking, this isn't necessary. But the numbers are impressive. For example, when converting a non-generaldelta mozilla-central repository, the messages you see are: migrating 2475593 total revisions (1833043 in filelogs, 321156 in manifests, 321394 in changelog) migrating 1.67 GB in store; 2508 GB tracked data migrating 267868 filelogs containing 1833043 revisions (1.09 GB in store; 57.3 GB tracked data) finished migrating 1833043 filelog revisions across 267868 filelogs; change in size: -415776 bytes migrating 1 manifests containing 321156 revisions (518 MB in store; 2451 GB tracked data) That "2508 GB" figure really blew me away. I had no clue that the raw tracked data in mozilla-central was that large. Granted, 2451 GB is in the manifest and "only" 57.3 GB is in filelogs. But still. It's worth noting that gratuitous loading of source revlogs in order to display numbers and progress bars does serve a purpose: it ensures we can open all source revlogs. We don't want to spend several minutes copying revlogs only to encounter a permissions error or similar later. As part of this commit, we also add swapping of the store directory to the upgrade function. After revlogs are converted, we move the old store into the backup directory then move the temporary repo's store into the old store's location. On well-behaved systems, this should be 2 atomic operations and the window of inconsistency show be very narrow. There are still a few improvements to be made to store copying and upgrading. But this commit gets the bulk of the work out of the way.
Sun, 18 Dec 2016 17:02:57 -0800 revlog: add clone method
Gregory Szorc <gregory.szorc@gmail.com> [Sun, 18 Dec 2016 17:02:57 -0800] rev 30778
revlog: add clone method Upcoming patches will introduce functionality for in-place repository/store "upgrades." Copying the contents of a revlog feels sufficiently low-level to warrant being in the revlog class. So this commit implements that functionality. Because full delta recomputation can be *very* expensive (we're talking several hours on the Firefox repository), we support multiple modes of execution with regards to delta (re)use. This will allow repository upgrades to choose the "level" of processing/optimization they wish to perform when converting revlogs. It's not obvious from this commit, but "addrevisioncb" will be used for progress reporting.
Sun, 18 Dec 2016 16:59:04 -0800 repair: begin implementation of in-place upgrading
Gregory Szorc <gregory.szorc@gmail.com> [Sun, 18 Dec 2016 16:59:04 -0800] rev 30777
repair: begin implementation of in-place upgrading Now that all the upgrade planning work is in place, we can start doing the real work: actually upgrading a repository. The main goal of this commit is to get the "framework" for running in-place upgrade actions in place. Rather than get too clever and low-level with regards to in-place upgrades, our strategy is to create a new, temporary repository, copy data to it, then replace the old data with the new. This allows us to reuse a lot of code in localrepo.py around store interaction, which will eventually consume the bulk of the upgrade code. But we have to start small. This patch implements adding new repository requirements. But it still sets up a temporary repository and locks it and the source repo before performing the requirements file swap. This means all the plumbing is in place to implement store copying in subsequent commits.
Sun, 18 Dec 2016 16:51:09 -0800 repair: determine what upgrade will do
Gregory Szorc <gregory.szorc@gmail.com> [Sun, 18 Dec 2016 16:51:09 -0800] rev 30776
repair: determine what upgrade will do This commit introduces code for determining what actions/improvements an upgrade should perform. The "upgradefindimprovements" function introduces a mechanism to return a list of improvements that can be made to a repository. Each improvement is effectively an action that an upgrade will perform. Associated with each of these improvements is metadata that will be used to inform users what's wrong and what an upgrade will do. Each "improvement" is categorized as a "deficiency" or an "optimization." TBH, I'm not thrilled about the terminology and am receptive to constructive bikeshedding. The main difference between a "deficiency" and an "optimization" is a deficiency is always corrected (if it deviates from the current config) and an "optimization" is an optional action that goes above and beyond to improve the state of the repository (usually by requiring more CPU during upgrade). Our initial set of improvements identifies missing repository requirements, a single, easily correctable problem with changelog storage, and a set of "optimizations" related to delta recalculation. The main "upgraderepo" function has been expanded to handle improvements. It queries for the list of improvements and determines which of them will run based on the current repository state and user I went through numerous iterations of the output format before settling on a ReST-inspired definition list format. (I used bulleted lists in the first submission of this commit and could not get it to format just right.) Even with the various iterations, I'm still not super thrilled with the format. But, this is a debug* command, so that should mean we can refine the output without BC concerns.
Sun, 18 Dec 2016 16:16:54 -0800 repair: implement requirements checking for upgrades
Gregory Szorc <gregory.szorc@gmail.com> [Sun, 18 Dec 2016 16:16:54 -0800] rev 30775
repair: implement requirements checking for upgrades This commit introduces functionality for upgrading a repository in place. The first part that's implemented is testing for upgrade "compatibility." This is done by examining repository requirements. There are 5 functions returning sets of requirements that control upgrading. Why so many functions? Mainly to support extensions. Functions are easier to monkeypatch than module variables. Astute readers will see that we don't support "manifestv2" and "treemanifest" requirements in the upgrade mechanism. I don't have a great answer for why other than this is a complex set of patches and I don't want to deal with the complexity of these experimental features just yet. We can teach the upgrade mechanism about them later, once the basic upgrade mechanism is in place. This commit also introduces the "upgraderepo" function. This will be our main routine for performing an in-place upgrade. Currently, it just implements requirements checking. The structure of some code in this function may look a bit weird (e.g. the inline function that is only called once). But this will make sense after future commits.
Thu, 24 Nov 2016 16:24:09 -0800 debugcommands: stub for debugupgraderepo command
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 24 Nov 2016 16:24:09 -0800] rev 30774
debugcommands: stub for debugupgraderepo command Currently, if Mercurial introduces a new repository/store feature or changes behavior of an existing feature, users must perform an `hg clone` to create a new repository with hopefully the correct/optimal settings. Unfortunately, even `hg clone` may not give the correct results. For example, if you do a local `hg clone`, you may get hardlinks to revlog files that inherit the old state. If you `hg clone` from a remote or `hg clone --pull`, changegroup application may bypass some optimization, such as converting to generaldelta. Optimizing a repository is harder than it seems and requires more than a simple `hg` command invocation. This commit starts the process of changing that. We introduce `hg debugupgraderepo`, a command that performs an in-place upgrade of a repository to use new, optimal features. The command is just a stub right now. Features will be added in subsequent commits. This commit does foreshadow some of the behavior of the new command, notably that it doesn't do anything by default and that it takes arguments that influence what actions it performs. These will be explained more in subsequent commits.
Wed, 11 Jan 2017 21:47:19 -0500 util: teach stringmatcher to handle forced case insensitive matches
Matt Harbison <matt_harbison@yahoo.com> [Wed, 11 Jan 2017 21:47:19 -0500] rev 30773
util: teach stringmatcher to handle forced case insensitive matches The 'author' and 'desc' revsets are documented to be case insensitive. Unfortunately, this was implemented in 'author' by forcing the input to lowercase, including for regex like '\B'. (This actually inverts the meaning of the sequence.) For backward compatibility, we will keep that a case insensitive regex, but by using matcher options instead of brute force. This doesn't preclude future hypothetical 'icase-literal:' style prefixes that can be provided by the user. Such user specified cases can probably be handled up front by stripping 'icase-', setting the variable, and letting it drop through the existing code.
Wed, 11 Jan 2017 23:13:51 -0500 revset: point to 'grep' in the 'keyword' help for regex searches
Matt Harbison <matt_harbison@yahoo.com> [Wed, 11 Jan 2017 23:13:51 -0500] rev 30772
revset: point to 'grep' in the 'keyword' help for regex searches The help for 'grep' already points to 'keyword'.
Wed, 11 Jan 2017 23:13:00 -0800 help: explain that revsets can be used where 1 or 2 revs are wanted
Martin von Zweigbergk <martinvonz@google.com> [Wed, 11 Jan 2017 23:13:00 -0800] rev 30771
help: explain that revsets can be used where 1 or 2 revs are wanted We did not seem to document that one can do things like "hg up :@" where the last revision of the revset ":@".
Wed, 11 Jan 2017 22:46:07 -0800 help: explain what the term "revset" means
Martin von Zweigbergk <martinvonz@google.com> [Wed, 11 Jan 2017 22:46:07 -0800] rev 30770
help: explain what the term "revset" means We refer to revsets in a few places (e.g. in "hg help config"), but we never explained what they are. Until now.
Wed, 11 Jan 2017 11:37:38 -0800 help: merge revsets.txt into revisions.txt
Martin von Zweigbergk <martinvonz@google.com> [Wed, 11 Jan 2017 11:37:38 -0800] rev 30769
help: merge revsets.txt into revisions.txt Selecting single and multiple revisions is closely related, so let's put it in one place, so users can easily find it. We actually did not even point to "hg help revsets" from "hg help revisions", but now that they're on a single page, that won't be necessary.
Wed, 11 Jan 2017 11:40:40 -0800 tests: use `hg help dates` instead of `hg help revs` in test
Martin von Zweigbergk <martinvonz@google.com> [Wed, 11 Jan 2017 11:40:40 -0800] rev 30768
tests: use `hg help dates` instead of `hg help revs` in test The revisions help is already long and will get longer, so switch to another short and stable topic.
Wed, 11 Jan 2017 11:28:54 -0800 help: use a single paragraph to describe full and abbreviated nodeids
Martin von Zweigbergk <martinvonz@google.com> [Wed, 11 Jan 2017 11:28:54 -0800] rev 30767
help: use a single paragraph to describe full and abbreviated nodeids The texts describing 40-digit strings and the abbreviated form are closely related, so make it a single paragraph.
Tue, 10 Jan 2017 23:37:08 -0800 hgweb: support Content Security Policy
Gregory Szorc <gregory.szorc@gmail.com> [Tue, 10 Jan 2017 23:37:08 -0800] rev 30766
hgweb: support Content Security Policy Content-Security-Policy (CSP) is a web security feature that allows servers to declare what loaded content is allowed to do. For example, a policy can prevent loading of images, JavaScript, CSS, etc unless the source of that content is whitelisted (by hostname, URI scheme, hashes of content, etc). It's a nifty security feature that provides extra mitigation against some attacks, notably XSS. Mitigation against these attacks is important for Mercurial because hgweb renders repository data, which is commonly untrusted. While we make attempts to escape things, etc, there's the possibility that malicious data could be injected into the site content. If this happens today, the full power of the web browser is available to that malicious content. A restrictive CSP policy (defined by the server operator and sent in an HTTP header which is outside the control of malicious content), could restrict browser capabilities and mitigate security problems posed by malicious data. CSP works by emitting an HTTP header declaring the policy that browsers should apply. Ideally, this header would be emitted by a layer above Mercurial (likely the HTTP server doing the WSGI "proxying"). This works for some CSP policies, but not all. For example, policies to allow inline JavaScript may require setting a "nonce" attribute on <script>. This attribute value must be unique and non-guessable. And, the value must be present in the HTTP header and the HTML body. This means that coordinating the value between Mercurial and another HTTP server could be difficult: it is much easier to generate and emit the nonce in a central location. This commit introduces support for emitting a Content-Security-Policy header from hgweb. A config option defines the header value. If present, the header is emitted. A special "%nonce%" syntax in the value triggers generation of a nonce and inclusion in <script> elements in templates. The inclusion of a nonce does not occur unless "%nonce%" is present. This makes this commit completely backwards compatible and the feature opt-in. The nonce is a type 4 UUID, which is the flavor that is randomly generated. It has 122 random bits, which should be plenty to satisfy the guarantees of a nonce.
Tue, 10 Jan 2017 20:47:48 -0800 hgweb: call process_dates() via DOM event listener
Gregory Szorc <gregory.szorc@gmail.com> [Tue, 10 Jan 2017 20:47:48 -0800] rev 30765
hgweb: call process_dates() via DOM event listener All the hgweb templates include mercurial.js in their header. All the hgweb templates have the same <script> boilerplate to run process_dates(). This patch factors that function call into mercurial.js as part of a DOMContentLoaded event listener.
Sat, 24 Dec 2016 15:29:32 -0700 protocol: send application/mercurial-0.2 responses to capable clients
Gregory Szorc <gregory.szorc@gmail.com> [Sat, 24 Dec 2016 15:29:32 -0700] rev 30764
protocol: send application/mercurial-0.2 responses to capable clients With this commit, the HTTP transport now parses the X-HgProto-<N> header to determine what media type and compression engine to use for responses. So far, we only compress responses that are already being compressed with zlib today (stream response types to specific commands). We can expand things to cover additional response types later. The practical side-effect of this commit is that non-zlib compression engines will be used if both ends support them. This means if both ends have zstd support, zstd - not zlib - will be used to compress data! When cloning the mozilla-unified repository between a local HTTP server and client, the benefits of non-zlib compression are quite noticeable: engine server CPU (s) client CPU (s) bundle size zlib (l=6) 174.1 283.2 1,148,547,026 zstd (l=1) 99.2 267.3 1,127,513,841 zstd (l=3) 103.1 266.9 1,018,861,363 zstd (l=7) 128.3 269.7 919,190,278 zstd (l=10) 162.0 - 894,547,179 none 95.3 277.2 4,097,566,064 The default zstd compression level is 3. So if you deploy zstd capable Mercurial to your clients and servers and CPU time on your server is dominated by "getbundle" requests (clients cloning and pulling) - and my experience at Mozilla tells me this is often the case - this commit could drastically reduce your server-side CPU usage *and* save on bandwidth costs! Another benefit of this change is that server operators can install *any* compression engine. While it isn't enabled by default, the "none" compression engine can now be used to disable wire protocol compression completely. Previously, commands like "getbundle" always zlib compressed output, adding considerable overhead to generating responses. If you are on a high speed network and your server is under high load, it might be advantageous to trade bandwidth for CPU. Although, zstd at level 1 doesn't use that much CPU, so I'm not convinced that disabling compression wholesale is worthwhile. And, my data seems to indicate a slow down on the client without compression. I suspect this is due to a lack of buffering resulting in an increase in socket read() calls and/or the fact we're transferring an extra 3 GB of data (parsing HTTP chunked transfer and processing extra TCP packets can add up). This is definitely worth investigating and optimizing. But since the "none" compressor isn't enabled by default, I'm inclined to punt on this issue. This commit introduces tons of tests. Some of these should arguably have been implemented on previous commits. But it was difficult to test without the server functionality in place.
Sat, 24 Dec 2016 15:22:18 -0700 httppeer: advertise and support application/mercurial-0.2
Gregory Szorc <gregory.szorc@gmail.com> [Sat, 24 Dec 2016 15:22:18 -0700] rev 30763
httppeer: advertise and support application/mercurial-0.2 Now that servers expose a capability indicating they support application/mercurial-0.2 and compression, clients can key off this to say they support responses that are compressed with various compression formats. After this commit, the HTTP wire protocol client now sends an "X-HgProto-<N>" request header indicating its support for "application/mercurial-0.2" media type and various compression formats. This commit also implements support for handling "application/mercurial-0.2" responses. It simply reads the header compression engine identifier then routes the remainder of the response to the appropriate decompressor. There were some test changes, but only to logging. That points to an obvious gap in our test coverage. This will be addressed in a subsequent commit once server support is in place (it is hard to test without server support).
Sat, 24 Dec 2016 15:21:46 -0700 wireproto: advertise supported media types and compression formats
Gregory Szorc <gregory.szorc@gmail.com> [Sat, 24 Dec 2016 15:21:46 -0700] rev 30762
wireproto: advertise supported media types and compression formats This commit introduces support for advertising a server's support for media types and compression formats in accordance with the spec defined in internals.wireproto. The bulk of the new code is a helper function in wireproto.py to obtain a prioritized list of compression engines available to the wire protocol. While not utilized yet, we implement support for obtaining the list of compression engines advertised by the client. The upcoming HTTP protocol enhancements are a bit lower-level than existing tests (most existing tests are command centric). So, this commit establishes a new test file that will be appropriate for holding tests around the functionality of the HTTP protocol itself. Rounding out this change, `hg debuginstall` now prints compression engines available to the server.
(0) -30000 -10000 -3000 -1000 -300 -100 -56 +56 +100 +300 +1000 +3000 +10000 tip