revlog: use an LRU cache for delta chain bases
Profiling using statprof revealed a hotspot during changegroup
application calculating delta chain bases on generaldelta repos.
Essentially, revlog._addrevision() was performing a lot of redundant
work tracing the delta chain as part of determining when the chain
distance was acceptable. This was most pronounced when adding
revisions to manifests, which can have delta chains thousands of
revisions long.
There was a delta chain base cache on revlogs before, but it only
captured a single revision. This was acceptable before generaldelta,
when _addrevision would build deltas from the previous revision and
thus we'd pretty much guarantee a cache hit when resolving the delta
chain base on a subsequent _addrevision call. However, it isn't
suitable for generaldelta because parent revisions aren't necessarily
the last processed revision.
This patch converts the delta chain base cache to an LRU dict cache.
The cache can hold multiple entries, so generaldelta repos have a
higher chance of getting a cache hit.
The impact of this change when processing changegroup additions is
significant. On a generaldelta conversion of the "mozilla-unified"
repo (which contains heads of the main Firefox repositories in
chronological order - this means there are lots of transitions between
heads in revlog order), this change has the following impact when
performing an `hg unbundle` of an uncompressed bundle of the repo:
before: 5:42 CPU time
after: 4:34 CPU time
Most of this time is saved when applying the changelog and manifest
revlogs:
before: 2:30 CPU time
after: 1:17 CPU time
That nearly a 50% reduction in CPU time applying changesets and
manifests!
Applying a gzipped bundle of the same repo (effectively simulating a
`hg clone` over HTTP) showed a similar speedup:
before: 5:53 CPU time
after: 4:46 CPU time
Wall time improvements were basically the same as CPU time.
I didn't measure explicitly, but it feels like most of the time
is saved when processing manifests. This makes sense, as large
manifests tend to have very long delta chains and thus benefit the
most from this cache.
So, this change effectively makes changegroup application (which is
used by `hg unbundle`, `hg clone`, `hg pull`, `hg unshelve`, and
various other commands) significantly faster when delta chains are
long (which can happen on repos with large numbers of files and thus
large manifests).
In theory, this change can result in more memory utilization. However,
we're caching a dict of ints. At most we have 200 ints + Python object
overhead per revlog. And, the cache is really only populated when
performing read-heavy operations, such as adding changegroups or
scanning an individual revlog. For memory bloat to be an issue, we'd
need to scan/read several revisions from several revlogs all while
having active references to several revlogs. I don't think there are
many operations that do this, so I don't think memory bloat from the
cache will be an issue.
from __future__ import absolute_import, print_function
import pprint
from mercurial import (
minirst,
)
def debugformat(text, form, **kwargs):
if form == 'html':
print("html format:")
out = minirst.format(text, style=form, **kwargs)
else:
print("%d column format:" % form)
out = minirst.format(text, width=form, **kwargs)
print("-" * 70)
if type(out) == tuple:
print(out[0][:-1])
print("-" * 70)
pprint.pprint(out[1])
else:
print(out[:-1])
print("-" * 70)
print()
def debugformats(title, text, **kwargs):
print("== %s ==" % title)
debugformat(text, 60, **kwargs)
debugformat(text, 30, **kwargs)
debugformat(text, 'html', **kwargs)
paragraphs = """
This is some text in the first paragraph.
A small indented paragraph.
It is followed by some lines
containing random whitespace.
\n \n \nThe third and final paragraph.
"""
debugformats('paragraphs', paragraphs)
definitions = """
A Term
Definition. The indented
lines make up the definition.
Another Term
Another definition. The final line in the
definition determines the indentation, so
this will be indented with four spaces.
A Nested/Indented Term
Definition.
"""
debugformats('definitions', definitions)
literals = r"""
The fully minimized form is the most
convenient form::
Hello
literal
world
In the partially minimized form a paragraph
simply ends with space-double-colon. ::
////////////////////////////////////////
long un-wrapped line in a literal block
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
::
This literal block is started with '::',
the so-called expanded form. The paragraph
with '::' disappears in the final output.
"""
debugformats('literals', literals)
lists = """
- This is the first list item.
Second paragraph in the first list item.
- List items need not be separated
by a blank line.
- And will be rendered without
one in any case.
We can have indented lists:
- This is an indented list item
- Another indented list item::
- A literal block in the middle
of an indented list.
(The above is not a list item since we are in the literal block.)
::
Literal block with no indentation (apart from
the two spaces added to all literal blocks).
1. This is an enumerated list (first item).
2. Continuing with the second item.
(1) foo
(2) bar
1) Another
2) List
Line blocks are also a form of list:
| This is the first line.
The line continues here.
| This is the second line.
"""
debugformats('lists', lists)
options = """
There is support for simple option lists,
but only with long options:
-X, --exclude filter an option with a short and long option with an argument
-I, --include an option with both a short option and a long option
--all Output all.
--both Output both (this description is
quite long).
--long Output all day long.
--par This option has two paragraphs in its description.
This is the first.
This is the second. Blank lines may be omitted between
options (as above) or left in (as here).
The next paragraph looks like an option list, but lacks the two-space
marker after the option. It is treated as a normal paragraph:
--foo bar baz
"""
debugformats('options', options)
fields = """
:a: First item.
:ab: Second item. Indentation and wrapping
is handled automatically.
Next list:
:small: The larger key below triggers full indentation here.
:much too large: This key is big enough to get its own line.
"""
debugformats('fields', fields)
containers = """
Normal output.
.. container:: debug
Initial debug output.
.. container:: verbose
Verbose output.
.. container:: debug
Debug output.
"""
debugformats('containers (normal)', containers)
debugformats('containers (verbose)', containers, keep=['verbose'])
debugformats('containers (debug)', containers, keep=['debug'])
debugformats('containers (verbose debug)', containers,
keep=['verbose', 'debug'])
roles = """Please see :hg:`add`."""
debugformats('roles', roles)
sections = """
Title
=====
Section
-------
Subsection
''''''''''
Markup: ``foo`` and :hg:`help`
------------------------------
"""
debugformats('sections', sections)
admonitions = """
.. note::
This is a note
- Bullet 1
- Bullet 2
.. warning:: This is a warning Second
input line of warning
.. danger::
This is danger
"""
debugformats('admonitions', admonitions)
comments = """
Some text.
.. A comment
.. An indented comment
Some indented text.
..
Empty comment above
"""
debugformats('comments', comments)
data = [['a', 'b', 'c'],
['1', '2', '3'],
['foo', 'bar', 'baz this list is very very very long man']]
rst = minirst.maketable(data, 2, True)
table = ''.join(rst)
print(table)
debugformats('table', table)
data = [['s', 'long', 'line\ngoes on here'],
['', 'xy', 'tried to fix here\n by indenting']]
rst = minirst.maketable(data, 1, False)
table = ''.join(rst)
print(table)
debugformats('table+nl', table)