match: sort patterns before compiling them into a regex
While investigating cripping performance for `hg cat` in some context, I
discovered that, for large inputs, building a regex from out of order patterns
result may result in a *much* slower regex and a much slower associated
matcher's performance.
So we are now sorting the patterns to help the regex engine.
There is more to the story as we rely on regexp more than we should. See the
next changeset for details.
Benchmarks
==========
In the following benchmark we are comparing the `hg cat` and `hg files` run
time when matching against the full list of files in the repository. They are
run:
- without the rust extensions
- with the standard python enfine (so without re2)
sort vs non-sorted - Before this changeset (
3f5137543773)
---------------------------------------------------------
###### hg files ###############################################################
### mercurial-2018-08-01-zstd-sparse-revlog
sorted: 0.230092 seconds
shuffled: 0.234235 seconds (+1.80%)
### pypy-2018-08-01-zstd-sparse-revlog
sorted: 0.613567 seconds
shuffled: 0.801880 seconds (+30.69%)
### mozilla-central-2018-08-01-zstd-sparse-revlog
sorted: 62.474221 seconds
shuffled: 1364.180218 seconds (+2083.59%)
### netbeans-2018-08-01-zstd-sparse-revlog
sorted: 21.541828 seconds
shuffled: 172.759857 seconds (+701.97%)
###### hg cat #################################################################
### mercurial-2018-08-01-zstd-sparse-revlog
sorted: 0.764407 seconds
shuffled: 0.768924 seconds
### pypy-2018-08-01-zstd-sparse-revlog
sorted: 2.065220 seconds
shuffled: 2.276388 seconds (+10.22%)
### netbeans-2018-08-01-zstd-sparse-revlog
sorted: 40.967983 seconds
shuffled: 216.388709 seconds (+428.19%)
### mozilla-central-2018-08-01-zstd-sparse-revlog
sorted: 105.228510 seconds
shuffled: 1448.722784 seconds (+1276.74%)
sort vs non-sorted - With this changeset
----------------------------------------
###### hg files ###############################################################
### mercurial-2018-08-01-zstd-sparse-revlog
all-list-pattern-sorted: 0.230069
all-list-pattern-shuffled: 0.231165
### pypy-2018-08-01-zstd-sparse-revlog
all-list-pattern-sorted: 0.616799
all-list-pattern-shuffled: 0.616393
### netbeans-2018-08-01-zstd-sparse-revlog
all-list-pattern-sorted: 21.586773
all-list-pattern-shuffled: 21.908197
### mozilla-central-2018-08-01-zstd-sparse-revlog
all-list-pattern-sorted: 61.279490
all-list-pattern-shuffled: 62.473549
###### hg cat #################################################################
### mercurial-2018-08-01-zstd-sparse-revlog
sorted: 0.763883 seconds
shuffled: 0.765848 seconds
### pypy-2018-08-01-zstd-sparse-revlog
sorted: 2.070498 seconds
shuffled: 2.069197 seconds
### netbeans-2018-08-01-zstd-sparse-revlog
sorted: 41.392423 seconds
shuffled: 41.648689 seconds
### mozilla-central-2018-08-01-zstd-sparse-revlog
sorted: 103.315670 seconds
shuffled: 104.369358 seconds
--- a/mercurial/match.py Mon Mar 27 17:30:14 2023 -0400
+++ b/mercurial/match.py Sat Apr 01 05:57:09 2023 +0200
@@ -197,7 +197,7 @@
Usually a patternmatcher is returned:
>>> _match(b'/foo', b'.', [br're:.*\.c$', b'path:foo/a', b'*.py'])
- <patternmatcher patterns='.*\\.c$|foo/a(?:/|$)|[^/]*\\.py$'>
+ <patternmatcher patterns='[^/]*\\.py$|foo/a(?:/|$)|.*\\.c$'>
Combining 'patterns' with 'include' (resp. 'exclude') gives an
intersectionmatcher (resp. a differencematcher):
@@ -614,7 +614,7 @@
True
>>> m.files()
- ['', 'foo/a', 'b', '']
+ [b'', b'foo/a', b'', b'b']
>>> m.exact(b'foo/a')
True
>>> m.exact(b'b')
@@ -625,6 +625,7 @@
def __init__(self, root, kindpats, badfn=None):
super(patternmatcher, self).__init__(badfn)
+ kindpats.sort()
self._files = _explicitfiles(kindpats)
self._prefix = _prefix(kindpats)
--- a/tests/test-fileset.t Mon Mar 27 17:30:14 2023 -0400
+++ b/tests/test-fileset.t Sat Apr 01 05:57:09 2023 +0200
@@ -53,7 +53,7 @@
(symbol 'glob')
(symbol 'b?')))
* matcher:
- <patternmatcher patterns='a1(?:/|$)|b.$'>
+ <patternmatcher patterns='b.$|a1(?:/|$)'>
a1
b1
b2
--- a/tests/test-walk.t Mon Mar 27 17:30:14 2023 -0400
+++ b/tests/test-walk.t Sat Apr 01 05:57:09 2023 +0200
@@ -536,7 +536,7 @@
f mammals/Procyonidae/raccoon mammals/Procyonidae/raccoon
$ hg debugwalk -v beans 'glob:beans/*'
* matcher:
- <patternmatcher patterns='beans(?:/|$)|beans/[^/]*$'>
+ <patternmatcher patterns='beans/[^/]*$|beans(?:/|$)'>
f beans/black beans/black
f beans/borlotti beans/borlotti
f beans/kidney beans/kidney