view mercurial/hbisect.py @ 39506:b66ea3fc3a86

sparse-revlog: set max delta chain length to on thousand The new snapshot system used in the sparse-revlog case gave us some small size benefit so far. However its most important property is to gracefully handle harder limit on delta chainlength. Long delta chain has a very detrimental impact on read (and write) performance in revlog. Being able to shorter them provide a great boost. However, shorting delta used to result significantly lower compression ratio. The intermediate snapshots effectively suppress most of this effect (even all in some case). # Effect on the test repository The repository we use for test is not "realistic" but can still show this in action using an unreasonably low chain limit. Limiting the chain length show a sizeable increase but stay under control: +6% for limit=15; +15% for limit=10. Without the snapshot system the increase is significantly bigger: +45% for limit=15; +80% for limit=10. Even slightly larger than without delta chain limit, the resulting size is still smaller than before we started doing snapshots. Here is a table for comparison. *Since the repository is not branchy, the initial sparse-revlog version does not bring much benefit compare to the non-sparse one): chain length limit | none | limit=15 | limit=10 | without sparse-revlog | 62 818 987 | 112 664 615 | 131 222 574 | without snapshot | 74 365 490 | 108 211 410 | 133 857 764 | with snapshot | 59 230 936 | 63 002 924 | 68 415 329 | # Effect On Real Life Repositories The series provides significant benefits on all kind of repositories. Using `hg debugupgraderepo -o redeltaparent --run`, we recomputed delta chain for various repositories with different settings: - delta chain length: unlimited or 1000 limit - sparse-revlog: enabled or disabled - this series: applied or not applied We can observe multiple types of effect: - On very branchy repositories: * The delta chain limit as low impact on the repo size. * Intermediate snapshot greatly reduces manifest size: - pypy: -80% - netbeans: -95% * The delta chain limit is effective, without a size impact: - netbeans average: 613 -> 282 - private #1 average: 1 068 -> 307 - On more linear repository: * Intermediate snapshot limit the impact of delta chain limit: - mozilla: without the series: +360% with the series: +25% * The delta chain limit provides large improvement: - mozilla's average chain length: unlimited: 15 338 limited: 469 * Despite the chain length limit, the manifest size is reduced: - mercurial: -25% - mozilla: -30% It is clear that the use of chains of intermediate snapshots provide large benefits both in storage size and delta chains quality. We should now switch our effort toward making sure the write performance are acceptable. Then, `sparse-revlog` will be a suitable format for all new repository. # Raw Statistic * no-sparse: general delta repository not using sparse-revlog * no-snapshot: sparse-revlog repository not using this series * snapshot: sparse-revlog repository using this series mercurial Manifest Size: limit | none | 1000 ------------|-------------|------------ no-sparse | 8 021 373 | 8 199 366 no-snapshot | 8 103 561 | 8 259 719 snapshot | 6 137 116 | 6 126 433 Manifest Chain length data limit || none || 1000 || value || average | max || average | max || ------------||---------|---------||---------|---------|| no-sparse || 307 | 1456 || 279 | 1000 || no-snapshot || 312 | 1456 || 283 | 1000 || snapshot || 248 | 1208 || 241 | 1000 || Full Store Size limit | none | 1000 ------------|-------------|------------ no-sparse | 51 013 198 | 51 201 574 no-snapshot | 50 930 795 | 51 141 006 snapshot | 48 072 037 | 48 093 572 pypy Manifest Size: limit | none | 1000 ------------|-------------|------------ no-sparse | 193 987 784 | 193 987 784 no-snapshot | 163 171 745 | 163 312 229 snapshot | 34 605 900 | 34 600 750 Manifest Chain length data limit || none || 1000 || value || average | max || average | max || ------------||---------|---------||---------|---------|| no-sparse || 101 | 692 || 101 | 692 || no-snapshot || 151 | 1307 || 148 | 1000 || snapshot || 128 | 1309 || 125 | 1000 || Full Store Size limit | none | 1000 ------------|-------------|------------ no-sparse | 495 931 473 | 495 931 473 no-snapshot | 465 441 017 | 465 581 501 snapshot | 355 467 301 | 355 472 451 Mozilla Manifest Size: limit | none | 1000 ------------|----------------|--------------- no-sparse | 416 757 148 | 1 869 009 668 no-snapshot | 401 592 370 | 1 843 493 795 snapshot | 224 359 521 | 284 615 500 Manifest Chain length data limit || none || 1000 || value || average | max || average | max || ------------||---------|---------||---------|---------|| no-sparse || 15 333 | 58 980 || 468 | 1 000 || no-snapshot || 15 336 | 58 980 || 469 | 1 000 || snapshot || 15 338 | 58 983 || 469 | 1 000 || Full Store Size limit | none | 1000 ------------|----------------|--------------- no-sparse | 2 712 477 887 | 4 164 995 451 no-snapshot | 2 698 887 835 | 4 141 054 304 snapshot | 2 518 130 385 | 2 578 587 596 Netbeans Manifest Size: limit | none | 1000 ------------|----------------|--------------- no-sparse | 4 766 794 101 | 4 870 642 687 no-snapshot | 4 334 806 082 | 4 428 681 309 snapshot | 232 659 666 | 240 330 665 Manifest Chain length data limit || none || 1000 || value || average | max || average | max || ------------||---------|---------||---------|---------|| no-sparse || 597 | 6802 || 254 | 1 000 || no-snapshot || 648 | 6 802 || 305 | 1 000 || snapshot || 613 | 6 804 || 282 | 1 000 || Full Store Size limit | none | 1000 ------------|----------------|--------------- no-sparse | 5 807 347 998 | 5 911 196 584 no-snapshot | 5 375 398 602 | 5 469 273 829 snapshot | 1 282 519 928 | 1 290 190 927 Private repo #1 Manifest Size: limit | none | 1000 ------------|-----------------|--------------- no-sparse | 41 389 010 840 | 41 398 162 091 no-snapshot | 9 737 319 435 | 10 223 773 150 snapshot | 744 215 807 | 747 961 822 Manifest Chain length data limit || none || 1000 || value || average | max || average | max || ------------||---------|---------||---------|---------|| no-sparse || 245 | 8 885 || 81 | 1 000 || no-snapshot || 1 225 | 8 885 || 336 | 1 000 || snapshot || 1 068 | 7 909 || 307 | 1 000 || Full Store Size limit | none | 1000 ------------|----------------|--------------- no-sparse | 49 646 065 126 | 49 655 216 377 no-snapshot | 17 924 862 856 | 18 411 316 571 snapshot | 9 009 024 710 | 9 012 770 725 Private repo #2 We currently have less data available for that repository. * Before is a sparse-revlog repository without this series * After is a sparse-revlog repository with this series + 1000 chain limit Manifest Size: Before: 1 531 485 040 bytes After: 1 091 422 451 bytes Manifest Chain: Before: 2 218 avg; 6 575 Max After: 442 avg; 1 000 Max Full Store Size Before: 15 203 955 615 after: 8 207 180 693
author Boris Feld <boris.feld@octobus.net>
date Fri, 07 Sep 2018 11:18:45 -0400
parents 71f189941791
children 566daffc607d
line wrap: on
line source

# changelog bisection for mercurial
#
# Copyright 2007 Matt Mackall
# Copyright 2005, 2006 Benoit Boissinot <benoit.boissinot@ens-lyon.org>
#
# Inspired by git bisect, extension skeleton taken from mq.py.
#
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.

from __future__ import absolute_import

import collections

from .i18n import _
from .node import (
    hex,
    short,
)
from . import (
    error,
)

def bisect(repo, state):
    """find the next node (if any) for testing during a bisect search.
    returns a (nodes, number, good) tuple.

    'nodes' is the final result of the bisect if 'number' is 0.
    Otherwise 'number' indicates the remaining possible candidates for
    the search and 'nodes' contains the next bisect target.
    'good' is True if bisect is searching for a first good changeset, False
    if searching for a first bad one.
    """

    changelog = repo.changelog
    clparents = changelog.parentrevs
    skip = set([changelog.rev(n) for n in state['skip']])

    def buildancestors(bad, good):
        badrev = min([changelog.rev(n) for n in bad])
        ancestors = collections.defaultdict(lambda: None)
        for rev in repo.revs("descendants(%ln) - ancestors(%ln)", good, good):
            ancestors[rev] = []
        if ancestors[badrev] is None:
            return badrev, None
        return badrev, ancestors

    good = False
    badrev, ancestors = buildancestors(state['bad'], state['good'])
    if not ancestors: # looking for bad to good transition?
        good = True
        badrev, ancestors = buildancestors(state['good'], state['bad'])
    bad = changelog.node(badrev)
    if not ancestors: # now we're confused
        if (len(state['bad']) == 1 and len(state['good']) == 1 and
            state['bad'] != state['good']):
            raise error.Abort(_("starting revisions are not directly related"))
        raise error.Abort(_("inconsistent state, %d:%s is good and bad")
                         % (badrev, short(bad)))

    # build children dict
    children = {}
    visit = collections.deque([badrev])
    candidates = []
    while visit:
        rev = visit.popleft()
        if ancestors[rev] == []:
            candidates.append(rev)
            for prev in clparents(rev):
                if prev != -1:
                    if prev in children:
                        children[prev].append(rev)
                    else:
                        children[prev] = [rev]
                        visit.append(prev)

    candidates.sort()
    # have we narrowed it down to one entry?
    # or have all other possible candidates besides 'bad' have been skipped?
    tot = len(candidates)
    unskipped = [c for c in candidates if (c not in skip) and (c != badrev)]
    if tot == 1 or not unskipped:
        return ([changelog.node(c) for c in candidates], 0, good)
    perfect = tot // 2

    # find the best node to test
    best_rev = None
    best_len = -1
    poison = set()
    for rev in candidates:
        if rev in poison:
            # poison children
            poison.update(children.get(rev, []))
            continue

        a = ancestors[rev] or [rev]
        ancestors[rev] = None

        x = len(a) # number of ancestors
        y = tot - x # number of non-ancestors
        value = min(x, y) # how good is this test?
        if value > best_len and rev not in skip:
            best_len = value
            best_rev = rev
            if value == perfect: # found a perfect candidate? quit early
                break

        if y < perfect and rev not in skip: # all downhill from here?
            # poison children
            poison.update(children.get(rev, []))
            continue

        for c in children.get(rev, []):
            if ancestors[c]:
                ancestors[c] = list(set(ancestors[c] + a))
            else:
                ancestors[c] = a + [c]

    assert best_rev is not None
    best_node = changelog.node(best_rev)

    return ([best_node], tot, good)

def extendrange(repo, state, nodes, good):
    # bisect is incomplete when it ends on a merge node and
    # one of the parent was not checked.
    parents = repo[nodes[0]].parents()
    if len(parents) > 1:
        if good:
            side = state['bad']
        else:
            side = state['good']
        num = len(set(i.node() for i in parents) & set(side))
        if num == 1:
            return parents[0].ancestor(parents[1])
    return None

def load_state(repo):
    state = {'current': [], 'good': [], 'bad': [], 'skip': []}
    for l in repo.vfs.tryreadlines("bisect.state"):
        kind, node = l[:-1].split()
        node = repo.lookup(node)
        if kind not in state:
            raise error.Abort(_("unknown bisect kind %s") % kind)
        state[kind].append(node)
    return state


def save_state(repo, state):
    f = repo.vfs("bisect.state", "w", atomictemp=True)
    with repo.wlock():
        for kind in sorted(state):
            for node in state[kind]:
                f.write("%s %s\n" % (kind, hex(node)))
        f.close()

def resetstate(repo):
    """remove any bisect state from the repository"""
    if repo.vfs.exists("bisect.state"):
        repo.vfs.unlink("bisect.state")

def checkstate(state):
    """check we have both 'good' and 'bad' to define a range

    Raise Abort exception otherwise."""
    if state['good'] and state['bad']:
        return True
    if not state['good']:
        raise error.Abort(_('cannot bisect (no known good revisions)'))
    else:
        raise error.Abort(_('cannot bisect (no known bad revisions)'))

def get(repo, status):
    """
    Return a list of revision(s) that match the given status:

    - ``good``, ``bad``, ``skip``: csets explicitly marked as good/bad/skip
    - ``goods``, ``bads``      : csets topologically good/bad
    - ``range``              : csets taking part in the bisection
    - ``pruned``             : csets that are goods, bads or skipped
    - ``untested``           : csets whose fate is yet unknown
    - ``ignored``            : csets ignored due to DAG topology
    - ``current``            : the cset currently being bisected
    """
    state = load_state(repo)
    if status in ('good', 'bad', 'skip', 'current'):
        return map(repo.changelog.rev, state[status])
    else:
        # In the following sets, we do *not* call 'bisect()' with more
        # than one level of recursion, because that can be very, very
        # time consuming. Instead, we always develop the expression as
        # much as possible.

        # 'range' is all csets that make the bisection:
        #   - have a good ancestor and a bad descendant, or conversely
        # that's because the bisection can go either way
        range = '( bisect(bad)::bisect(good) | bisect(good)::bisect(bad) )'

        _t = repo.revs('bisect(good)::bisect(bad)')
        # The sets of topologically good or bad csets
        if len(_t) == 0:
            # Goods are topologically after bads
            goods = 'bisect(good)::'    # Pruned good csets
            bads  = '::bisect(bad)'     # Pruned bad csets
        else:
            # Goods are topologically before bads
            goods = '::bisect(good)'    # Pruned good csets
            bads  = 'bisect(bad)::'     # Pruned bad csets

        # 'pruned' is all csets whose fate is already known: good, bad, skip
        skips = 'bisect(skip)'                 # Pruned skipped csets
        pruned = '( (%s) | (%s) | (%s) )' % (goods, bads, skips)

        # 'untested' is all cset that are- in 'range', but not in 'pruned'
        untested = '( (%s) - (%s) )' % (range, pruned)

        # 'ignored' is all csets that were not used during the bisection
        # due to DAG topology, but may however have had an impact.
        # E.g., a branch merged between bads and goods, but whose branch-
        # point is out-side of the range.
        iba = '::bisect(bad) - ::bisect(good)'  # Ignored bads' ancestors
        iga = '::bisect(good) - ::bisect(bad)'  # Ignored goods' ancestors
        ignored = '( ( (%s) | (%s) ) - (%s) )' % (iba, iga, range)

        if status == 'range':
            return repo.revs(range)
        elif status == 'pruned':
            return repo.revs(pruned)
        elif status == 'untested':
            return repo.revs(untested)
        elif status == 'ignored':
            return repo.revs(ignored)
        elif status == "goods":
            return repo.revs(goods)
        elif status == "bads":
            return repo.revs(bads)
        else:
            raise error.ParseError(_('invalid bisect state'))

def label(repo, node):
    rev = repo.changelog.rev(node)

    # Try explicit sets
    if rev in get(repo, 'good'):
        # i18n: bisect changeset status
        return _('good')
    if rev in get(repo, 'bad'):
        # i18n: bisect changeset status
        return _('bad')
    if rev in get(repo, 'skip'):
        # i18n: bisect changeset status
        return _('skipped')
    if rev in get(repo, 'untested') or rev in get(repo, 'current'):
        # i18n: bisect changeset status
        return _('untested')
    if rev in get(repo, 'ignored'):
        # i18n: bisect changeset status
        return _('ignored')

    # Try implicit sets
    if rev in get(repo, 'goods'):
        # i18n: bisect changeset status
        return _('good (implicit)')
    if rev in get(repo, 'bads'):
        # i18n: bisect changeset status
        return _('bad (implicit)')

    return None

def printresult(ui, repo, state, displayer, nodes, good):
    if len(nodes) == 1:
        # narrowed it down to a single revision
        if good:
            ui.write(_("The first good revision is:\n"))
        else:
            ui.write(_("The first bad revision is:\n"))
        displayer.show(repo[nodes[0]])
        extendnode = extendrange(repo, state, nodes, good)
        if extendnode is not None:
            ui.write(_('Not all ancestors of this changeset have been'
                       ' checked.\nUse bisect --extend to continue the '
                       'bisection from\nthe common ancestor, %s.\n')
                     % extendnode)
    else:
        # multiple possible revisions
        if good:
            ui.write(_("Due to skipped revisions, the first "
                    "good revision could be any of:\n"))
        else:
            ui.write(_("Due to skipped revisions, the first "
                    "bad revision could be any of:\n"))
        for n in nodes:
            displayer.show(repo[n])
    displayer.close()