Mercurial > hg
changeset 31210:e1d035905b2e
similar: compare between actual file contents for exact identity
Before this patch, similarity detection logic (for addremove and
automv) depends entirely on SHA-1 digesting. But this causes incorrect
rename detection, if:
- removing file A and adding file B occur at same committing, and
- SHA-1 hash values of file A and B are same
This may prevent security experts from managing sample files for
SHAttered issue in Mercurial repository, for example.
https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
https://shattered.it/
Hash collision itself isn't so serious for core repository
functionality of Mercurial, described by mpm as below, though.
https://www.mercurial-scm.org/wiki/mpm/SHA1
This patch compares between actual file contents after hash comparison
for exact identity.
Even after this patch, SHA-1 is still used, because it is reasonable
enough to quickly detect existence of "(almost) same" file.
- replacing SHA-1 causes decreasing performance, and
- replacement of it has ambiguity, yet
Getting content of removed file (= rfctx.data()) at each exact
comparison should be cheap enough, even though getting content of
added one costs much.
======= ============== =====================
file fctx data() reads from
======= ============== =====================
removed filectx in-memory revlog data
added workingfilectx storage
======= ============== =====================
author | FUJIWARA Katsunori <foozy@lares.dti.ne.jp> |
---|---|
date | Fri, 03 Mar 2017 02:57:06 +0900 |
parents | dd2364f5180a |
children | ecbd378d9a7e |
files | mercurial/similar.py |
diffstat | 1 files changed, 6 insertions(+), 2 deletions(-) [+] |
line wrap: on
line diff
--- a/mercurial/similar.py Thu Mar 02 21:49:30 2017 -0800 +++ b/mercurial/similar.py Fri Mar 03 02:57:06 2017 +0900 @@ -35,9 +35,13 @@ for i, fctx in enumerate(added): repo.ui.progress(_('searching for exact renames'), i + len(removed), total=numfiles, unit=_('files')) - h = hashlib.sha1(fctx.data()).digest() + adata = fctx.data() + h = hashlib.sha1(adata).digest() if h in hashes: - yield (hashes[h], fctx) + rfctx = hashes[h] + # compare between actual file contents for exact identity + if adata == rfctx.data(): + yield (rfctx, fctx) # Done repo.ui.progress(_('searching for exact renames'), None)