revlog: use an LRU cache for delta chain bases
Profiling using statprof revealed a hotspot during changegroup
application calculating delta chain bases on generaldelta repos.
Essentially, revlog._addrevision() was performing a lot of redundant
work tracing the delta chain as part of determining when the chain
distance was acceptable. This was most pronounced when adding
revisions to manifests, which can have delta chains thousands of
revisions long.
There was a delta chain base cache on revlogs before, but it only
captured a single revision. This was acceptable before generaldelta,
when _addrevision would build deltas from the previous revision and
thus we'd pretty much guarantee a cache hit when resolving the delta
chain base on a subsequent _addrevision call. However, it isn't
suitable for generaldelta because parent revisions aren't necessarily
the last processed revision.
This patch converts the delta chain base cache to an LRU dict cache.
The cache can hold multiple entries, so generaldelta repos have a
higher chance of getting a cache hit.
The impact of this change when processing changegroup additions is
significant. On a generaldelta conversion of the "mozilla-unified"
repo (which contains heads of the main Firefox repositories in
chronological order - this means there are lots of transitions between
heads in revlog order), this change has the following impact when
performing an `hg unbundle` of an uncompressed bundle of the repo:
before: 5:42 CPU time
after: 4:34 CPU time
Most of this time is saved when applying the changelog and manifest
revlogs:
before: 2:30 CPU time
after: 1:17 CPU time
That nearly a 50% reduction in CPU time applying changesets and
manifests!
Applying a gzipped bundle of the same repo (effectively simulating a
`hg clone` over HTTP) showed a similar speedup:
before: 5:53 CPU time
after: 4:46 CPU time
Wall time improvements were basically the same as CPU time.
I didn't measure explicitly, but it feels like most of the time
is saved when processing manifests. This makes sense, as large
manifests tend to have very long delta chains and thus benefit the
most from this cache.
So, this change effectively makes changegroup application (which is
used by `hg unbundle`, `hg clone`, `hg pull`, `hg unshelve`, and
various other commands) significantly faster when delta chains are
long (which can happen on repos with large numbers of files and thus
large manifests).
In theory, this change can result in more memory utilization. However,
we're caching a dict of ints. At most we have 200 ints + Python object
overhead per revlog. And, the cache is really only populated when
performing read-heavy operations, such as adding changegroups or
scanning an individual revlog. For memory bloat to be an issue, we'd
need to scan/read several revisions from several revlogs all while
having active references to several revlogs. I don't think there are
many operations that do this, so I don't think memory bloat from the
cache will be an issue.
"""strip changesets and their descendants from history
This extension allows you to strip changesets and all their descendants from the
repository. See the command help for details.
"""
from __future__ import absolute_import
from mercurial.i18n import _
from mercurial import (
bookmarks as bookmarksmod,
cmdutil,
error,
hg,
lock as lockmod,
merge,
node as nodemod,
repair,
scmutil,
util,
)
nullid = nodemod.nullid
release = lockmod.release
cmdtable = {}
command = cmdutil.command(cmdtable)
# Note for extension authors: ONLY specify testedwith = 'internal' for
# extensions which SHIP WITH MERCURIAL. Non-mainline extensions should
# be specifying the version(s) of Mercurial they are tested with, or
# leave the attribute unspecified.
testedwith = 'internal'
def checksubstate(repo, baserev=None):
'''return list of subrepos at a different revision than substate.
Abort if any subrepos have uncommitted changes.'''
inclsubs = []
wctx = repo[None]
if baserev:
bctx = repo[baserev]
else:
bctx = wctx.parents()[0]
for s in sorted(wctx.substate):
wctx.sub(s).bailifchanged(True)
if s not in bctx.substate or bctx.sub(s).dirty():
inclsubs.append(s)
return inclsubs
def checklocalchanges(repo, force=False, excsuffix=''):
cmdutil.checkunfinished(repo)
s = repo.status()
if not force:
if s.modified or s.added or s.removed or s.deleted:
_("local changes found") # i18n tool detection
raise error.Abort(_("local changes found" + excsuffix))
if checksubstate(repo):
_("local changed subrepos found") # i18n tool detection
raise error.Abort(_("local changed subrepos found" + excsuffix))
return s
def strip(ui, repo, revs, update=True, backup=True, force=None, bookmarks=None):
wlock = lock = None
try:
wlock = repo.wlock()
lock = repo.lock()
if update:
checklocalchanges(repo, force=force)
urev, p2 = repo.changelog.parents(revs[0])
if (util.safehasattr(repo, 'mq') and
p2 != nullid
and p2 in [x.node for x in repo.mq.applied]):
urev = p2
hg.clean(repo, urev)
repo.dirstate.write(repo.currenttransaction())
repair.strip(ui, repo, revs, backup)
repomarks = repo._bookmarks
if bookmarks:
with repo.transaction('strip') as tr:
if repo._activebookmark in bookmarks:
bookmarksmod.deactivate(repo)
for bookmark in bookmarks:
del repomarks[bookmark]
repomarks.recordchange(tr)
for bookmark in sorted(bookmarks):
ui.write(_("bookmark '%s' deleted\n") % bookmark)
finally:
release(lock, wlock)
@command("strip",
[
('r', 'rev', [], _('strip specified revision (optional, '
'can specify revisions without this '
'option)'), _('REV')),
('f', 'force', None, _('force removal of changesets, discard '
'uncommitted changes (no backup)')),
('', 'no-backup', None, _('no backups')),
('', 'nobackup', None, _('no backups (DEPRECATED)')),
('n', '', None, _('ignored (DEPRECATED)')),
('k', 'keep', None, _("do not modify working directory during "
"strip")),
('B', 'bookmark', [], _("remove revs only reachable from given"
" bookmark"))],
_('hg strip [-k] [-f] [-B bookmark] [-r] REV...'))
def stripcmd(ui, repo, *revs, **opts):
"""strip changesets and all their descendants from the repository
The strip command removes the specified changesets and all their
descendants. If the working directory has uncommitted changes, the
operation is aborted unless the --force flag is supplied, in which
case changes will be discarded.
If a parent of the working directory is stripped, then the working
directory will automatically be updated to the most recent
available ancestor of the stripped parent after the operation
completes.
Any stripped changesets are stored in ``.hg/strip-backup`` as a
bundle (see :hg:`help bundle` and :hg:`help unbundle`). They can
be restored by running :hg:`unbundle .hg/strip-backup/BUNDLE`,
where BUNDLE is the bundle file created by the strip. Note that
the local revision numbers will in general be different after the
restore.
Use the --no-backup option to discard the backup bundle once the
operation completes.
Strip is not a history-rewriting operation and can be used on
changesets in the public phase. But if the stripped changesets have
been pushed to a remote repository you will likely pull them again.
Return 0 on success.
"""
backup = True
if opts.get('no_backup') or opts.get('nobackup'):
backup = False
cl = repo.changelog
revs = list(revs) + opts.get('rev')
revs = set(scmutil.revrange(repo, revs))
with repo.wlock():
bookmarks = set(opts.get('bookmark'))
if bookmarks:
repomarks = repo._bookmarks
if not bookmarks.issubset(repomarks):
raise error.Abort(_("bookmark '%s' not found") %
','.join(sorted(bookmarks - set(repomarks.keys()))))
# If the requested bookmark is not the only one pointing to a
# a revision we have to only delete the bookmark and not strip
# anything. revsets cannot detect that case.
nodetobookmarks = {}
for mark, node in repomarks.iteritems():
nodetobookmarks.setdefault(node, []).append(mark)
for marks in nodetobookmarks.values():
if bookmarks.issuperset(marks):
rsrevs = repair.stripbmrevset(repo, marks[0])
revs.update(set(rsrevs))
if not revs:
lock = tr = None
try:
lock = repo.lock()
tr = repo.transaction('bookmark')
for bookmark in bookmarks:
del repomarks[bookmark]
repomarks.recordchange(tr)
tr.close()
for bookmark in sorted(bookmarks):
ui.write(_("bookmark '%s' deleted\n") % bookmark)
finally:
release(lock, tr)
if not revs:
raise error.Abort(_('empty revision set'))
descendants = set(cl.descendants(revs))
strippedrevs = revs.union(descendants)
roots = revs.difference(descendants)
update = False
# if one of the wdir parent is stripped we'll need
# to update away to an earlier revision
for p in repo.dirstate.parents():
if p != nullid and cl.rev(p) in strippedrevs:
update = True
break
rootnodes = set(cl.node(r) for r in roots)
q = getattr(repo, 'mq', None)
if q is not None and q.applied:
# refresh queue state if we're about to strip
# applied patches
if cl.rev(repo.lookup('qtip')) in strippedrevs:
q.applieddirty = True
start = 0
end = len(q.applied)
for i, statusentry in enumerate(q.applied):
if statusentry.node in rootnodes:
# if one of the stripped roots is an applied
# patch, only part of the queue is stripped
start = i
break
del q.applied[start:end]
q.savedirty()
revs = sorted(rootnodes)
if update and opts.get('keep'):
urev, p2 = repo.changelog.parents(revs[0])
if (util.safehasattr(repo, 'mq') and p2 != nullid
and p2 in [x.node for x in repo.mq.applied]):
urev = p2
uctx = repo[urev]
# only reset the dirstate for files that would actually change
# between the working context and uctx
descendantrevs = repo.revs("%s::." % uctx.rev())
changedfiles = []
for rev in descendantrevs:
# blindly reset the files, regardless of what actually changed
changedfiles.extend(repo[rev].files())
# reset files that only changed in the dirstate too
dirstate = repo.dirstate
dirchanges = [f for f in dirstate if dirstate[f] != 'n']
changedfiles.extend(dirchanges)
repo.dirstate.rebuild(urev, uctx.manifest(), changedfiles)
repo.dirstate.write(repo.currenttransaction())
# clear resolve state
merge.mergestate.clean(repo, repo['.'].node())
update = False
strip(ui, repo, revs, backup=backup, update=update,
force=opts.get('force'), bookmarks=bookmarks)
return 0