view mercurial/similar.py @ 32697:19b9fc40cc51

revlog: skeleton support for version 2 revlogs There are a number of improvements we want to make to revlogs that will require a new version - version 2. It is unclear what the full set of improvements will be or when we'll be done with them. What I do know is that the process will likely take longer than a single release, will require input from various stakeholders to evaluate changes, and will have many contentious debates and bikeshedding. It is unrealistic to develop revlog version 2 up front: there are just too many uncertainties that we won't know until things are implemented and experiments are run. Some changes will also be invasive and prone to bit rot, so sitting on dozens of patches is not practical. This commit introduces skeleton support for version 2 revlogs in a way that is flexible and not bound by backwards compatibility concerns. An experimental repo requirement for denoting revlog v2 has been added. The requirement string has a sub-version component to it. This will allow us to declare multiple requirements in the course of developing revlog v2. Whenever we change the in-development revlog v2 format, we can tweak the string, creating a new requirement and locking out old clients. This will allow us to make as many backwards incompatible changes and experiments to revlog v2 as we want. In other words, we can land code and make meaningful progress towards revlog v2 while still maintaining extreme format flexibility up until the point we freeze the format and remove the experimental labels. To enable the new repo requirement, you must supply an experimental and undocumented config option. But not just any boolean flag will do: you need to explicitly use a value that no sane person should ever type. This is an additional guard against enabling revlog v2 on an installation it shouldn't be enabled on. The specific scenario I'm trying to prevent is say a user with a 4.4 client with a frozen format enabling the option but then downgrading to 4.3 and accidentally creating repos with an outdated and unsupported repo format. Requiring a "challenge" string should prevent this. Because the format is not yet finalized and I don't want to take any chances, revlog v2's version is currently 0xDEAD. I figure squatting on a value we're likely never to use as an actual revlog version to mean "internal testing only" is acceptable. And "dead" is easily recognized as something meaningful. There is a bunch of cleanup that is needed before work on revlog v2 begins in earnest. I plan on doing that work once this patch is accepted and we're comfortable with the idea of starting down this path.
author Gregory Szorc <gregory.szorc@gmail.com>
date Fri, 19 May 2017 20:29:11 -0700
parents ded48ad55146
children cd196be26cb7
line wrap: on
line source

# similar.py - mechanisms for finding similar files
#
# Copyright 2005-2007 Matt Mackall <mpm@selenic.com>
#
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.

from __future__ import absolute_import

from .i18n import _
from . import (
    mdiff,
)

def _findexactmatches(repo, added, removed):
    '''find renamed files that have no changes

    Takes a list of new filectxs and a list of removed filectxs, and yields
    (before, after) tuples of exact matches.
    '''
    numfiles = len(added) + len(removed)

    # Build table of removed files: {hash(fctx.data()): [fctx, ...]}.
    # We use hash() to discard fctx.data() from memory.
    hashes = {}
    for i, fctx in enumerate(removed):
        repo.ui.progress(_('searching for exact renames'), i, total=numfiles,
                         unit=_('files'))
        h = hash(fctx.data())
        if h not in hashes:
            hashes[h] = [fctx]
        else:
            hashes[h].append(fctx)

    # For each added file, see if it corresponds to a removed file.
    for i, fctx in enumerate(added):
        repo.ui.progress(_('searching for exact renames'), i + len(removed),
                total=numfiles, unit=_('files'))
        adata = fctx.data()
        h = hash(adata)
        for rfctx in hashes.get(h, []):
            # compare between actual file contents for exact identity
            if adata == rfctx.data():
                yield (rfctx, fctx)
                break

    # Done
    repo.ui.progress(_('searching for exact renames'), None)

def _ctxdata(fctx):
    # lazily load text
    orig = fctx.data()
    return orig, mdiff.splitnewlines(orig)

def _score(fctx, otherdata):
    orig, lines = otherdata
    text = fctx.data()
    # mdiff.blocks() returns blocks of matching lines
    # count the number of bytes in each
    equal = 0
    matches = mdiff.blocks(text, orig)
    for x1, x2, y1, y2 in matches:
        for line in lines[y1:y2]:
            equal += len(line)

    lengths = len(text) + len(orig)
    return equal * 2.0 / lengths

def score(fctx1, fctx2):
    return _score(fctx1, _ctxdata(fctx2))

def _findsimilarmatches(repo, added, removed, threshold):
    '''find potentially renamed files based on similar file content

    Takes a list of new filectxs and a list of removed filectxs, and yields
    (before, after, score) tuples of partial matches.
    '''
    copies = {}
    for i, r in enumerate(removed):
        repo.ui.progress(_('searching for similar files'), i,
                         total=len(removed), unit=_('files'))

        data = None
        for a in added:
            bestscore = copies.get(a, (None, threshold))[1]
            if data is None:
                data = _ctxdata(r)
            myscore = _score(a, data)
            if myscore > bestscore:
                copies[a] = (r, myscore)
    repo.ui.progress(_('searching'), None)

    for dest, v in copies.iteritems():
        source, bscore = v
        yield source, dest, bscore

def _dropempty(fctxs):
    return [x for x in fctxs if x.size() > 0]

def findrenames(repo, added, removed, threshold):
    '''find renamed files -- yields (before, after, score) tuples'''
    wctx = repo[None]
    pctx = wctx.p1()

    # Zero length files will be frequently unrelated to each other, and
    # tracking the deletion/addition of such a file will probably cause more
    # harm than good. We strip them out here to avoid matching them later on.
    addedfiles = _dropempty(wctx[fp] for fp in sorted(added))
    removedfiles = _dropempty(pctx[fp] for fp in sorted(removed) if fp in pctx)

    # Find exact matches.
    matchedfiles = set()
    for (a, b) in _findexactmatches(repo, addedfiles, removedfiles):
        matchedfiles.add(b)
        yield (a.path(), b.path(), 1.0)

    # If the user requested similar files to be matched, search for them also.
    if threshold < 1.0:
        addedfiles = [x for x in addedfiles if x not in matchedfiles]
        for (a, b, score) in _findsimilarmatches(repo, addedfiles,
                                                 removedfiles, threshold):
            yield (a.path(), b.path(), score)