Mercurial > hg
comparison mercurial/copies.py @ 42222:57203e0210f8
copies: calculate mergecopies() based on pathcopies()
When copies are stored in changesets, we need a changeset-centric
version of mergecopies() just like we have a changeset-centric version
of pathcopies(). I think the natural way of thinking about
mergecopies() is in terms of pathcopies() from the base to each of the
commits. So if we can rewrite mergecopies() based on two such
pathcopies() calls, we'll get the changeset-centric version for
free. That's what this patch does.
A nice bonus is that it ends up being a lot simpler. mergecopies() has
accumulated a lot of technical debt over time. One good example is the
code for dealing with grafts (the "partial/incomplete/dirty"
stuff). Since pathcopies() already deals with backwards renames and
ping-pong renames, we get that for free.
I've run tests with hard-coded debug logging for "fullcopy" and while
I haven't looked at every difference it produces, all the ones I have
looked at seemed reasonable to me. I'm a little surprised that no more
tests fail when run with '--extra-config-opt
experimental.copies.read-from=compatibility' compared to before this
patch. This patch also fixes the broken cases in test-annotate.t and
test-fastannotate.t. It also enables the part of test-copies.t that
was previously disabled exactly because mergecopies() needed to get a
changeset-centric version.
One drawback of the rewritten code is that we may now make
remotefilelog prefetch more files. We used to prefetch files that were
unique to either side of the merge compared to the other. We now
prefetch files that are unique to either side of the merge compared to
the base. This means that if you added the same file to each side, we
would not prefetch it before, but we would now. Such cases are
probably quite rare, but one likely scenario where they happen is when
moving from a commit to its successor (or the other way around). The
user will probably already have the files in the cache in such cases,
so it's probably not a big deal.
Some timings for calculating mergecopies between two revisions
(revisions shown on each line, all using the common ancestor as base):
In the hg repo:
4.8 4.9: 0.21s -> 0.21s
4.0 4.8: 0.35s -> 0.63s
In and old copy of the mozilla-unified repo:
FIREFOX_BETA_60_BASE^ FIREFOX_BETA_60_BASE: 0.82s -> 0.82s
FIREFOX_NIGHTLY_59_END FIREFOX_BETA_60_BASE: 2.5s -> 2.6s
FIREFOX_BETA_59_END FIREFOX_BETA_60_BASE: 3.9s -> 4.1s
FIREFOX_AURORA_50_BASE FIREFOX_BETA_60_BASE: 31s -> 33s
So it's measurably slower in most cases. The most significant
difference is in the hg repo between revisions 4.0 and 4.8. In that
case it seems to come from the fact that pathcopies() uses
fctx.isintroducedafter() (in _tracefile), while the old mergecopies()
used fctx.linkrev() (in _checkcopies()). That results in a single call
to filectx._adjustlinkrev(), which is responsible for the entire
difference in time (in my repo). So we pay a performance penalty but
we get more correct code (see change in
test-mv-cp-st-diff.t). Deleting the "== f.filenode()" in _tracefile()
recovers the lost performance in the hg repo.
There were are few other optimizations in _checkcopies() that I could
not measure any impact from. One was from the "seen" set. Another was
from a "continue" when the file was not in the destination manifest
(corresponding to "am" in _tracefile).
Also note that merge copies are not calculated when updating with a
clean working copy, which is probably the most common case. I
therefore think the much simpler code is worth the slowdown.
Differential Revision: https://phab.mercurial-scm.org/D6255
author | Martin von Zweigbergk <martinvonz@google.com> |
---|---|
date | Thu, 11 Apr 2019 23:22:54 -0700 |
parents | 390ec72b8ea4 |
children | d69bc8ffbe6f |
comparison
equal
deleted
inserted
replaced
42221:c83c08cf02b7 | 42222:57203e0210f8 |
---|---|
371 if u2: | 371 if u2: |
372 repo.ui.debug("%s:\n %s\n" % (header % 'other', "\n ".join(u2))) | 372 repo.ui.debug("%s:\n %s\n" % (header % 'other', "\n ".join(u2))) |
373 | 373 |
374 return u1, u2 | 374 return u1, u2 |
375 | 375 |
376 def _makegetfctx(ctx): | |
377 """return a 'getfctx' function suitable for _checkcopies usage | |
378 | |
379 We have to re-setup the function building 'filectx' for each | |
380 '_checkcopies' to ensure the linkrev adjustment is properly setup for | |
381 each. Linkrev adjustment is important to avoid bug in rename | |
382 detection. Moreover, having a proper '_ancestrycontext' setup ensures | |
383 the performance impact of this adjustment is kept limited. Without it, | |
384 each file could do a full dag traversal making the time complexity of | |
385 the operation explode (see issue4537). | |
386 | |
387 This function exists here mostly to limit the impact on stable. Feel | |
388 free to refactor on default. | |
389 """ | |
390 rev = ctx.rev() | |
391 repo = ctx._repo | |
392 ac = getattr(ctx, '_ancestrycontext', None) | |
393 if ac is None: | |
394 revs = [rev] | |
395 if rev is None: | |
396 revs = [p.rev() for p in ctx.parents()] | |
397 ac = repo.changelog.ancestors(revs, inclusive=True) | |
398 ctx._ancestrycontext = ac | |
399 def makectx(f, n): | |
400 if n in node.wdirfilenodeids: # in a working context? | |
401 if ctx.rev() is None: | |
402 return ctx.filectx(f) | |
403 return repo[None][f] | |
404 fctx = repo.filectx(f, fileid=n) | |
405 # setup only needed for filectx not create from a changectx | |
406 fctx._ancestrycontext = ac | |
407 fctx._descendantrev = rev | |
408 return fctx | |
409 return util.lrucachefunc(makectx) | |
410 | |
411 def _combinecopies(copyfrom, copyto, finalcopy, diverge, incompletediverge): | |
412 """combine partial copy paths""" | |
413 remainder = {} | |
414 for f in copyfrom: | |
415 if f in copyto: | |
416 finalcopy[copyto[f]] = copyfrom[f] | |
417 del copyto[f] | |
418 for f in incompletediverge: | |
419 assert f not in diverge | |
420 ic = incompletediverge[f] | |
421 if ic[0] in copyto: | |
422 diverge[f] = [copyto[ic[0]], ic[1]] | |
423 else: | |
424 remainder[f] = ic | |
425 return remainder | |
426 | |
427 def mergecopies(repo, c1, c2, base): | 376 def mergecopies(repo, c1, c2, base): |
428 """ | 377 """ |
429 Finds moves and copies between context c1 and c2 that are relevant for | 378 Finds moves and copies between context c1 and c2 that are relevant for |
430 merging. 'base' will be used as the merge base. | 379 merging. 'base' will be used as the merge base. |
431 | 380 |
516 'copytrace.sourcecommitlimit') | 465 'copytrace.sourcecommitlimit') |
517 commits = len(repo.revs('%d::%d', base.rev(), c1.rev())) | 466 commits = len(repo.revs('%d::%d', base.rev(), c1.rev())) |
518 return commits < sourcecommitlimit | 467 return commits < sourcecommitlimit |
519 return False | 468 return False |
520 | 469 |
470 def _checksinglesidecopies(src, dsts1, m1, m2, mb, c2, base, | |
471 copy, renamedelete): | |
472 if src not in m2: | |
473 # deleted on side 2 | |
474 if src not in m1: | |
475 # renamed on side 1, deleted on side 2 | |
476 renamedelete[src] = dsts1 | |
477 elif m2[src] != mb[src]: | |
478 if not _related(c2[src], base[src]): | |
479 return | |
480 # modified on side 2 | |
481 for dst in dsts1: | |
482 if dst not in m2: | |
483 # dst not added on side 2 (handle as regular | |
484 # "both created" case in manifestmerge otherwise) | |
485 copy[dst] = src | |
486 | |
521 def _fullcopytracing(repo, c1, c2, base): | 487 def _fullcopytracing(repo, c1, c2, base): |
522 """ The full copytracing algorithm which finds all the new files that were | 488 """ The full copytracing algorithm which finds all the new files that were |
523 added from merge base up to the top commit and for each file it checks if | 489 added from merge base up to the top commit and for each file it checks if |
524 this file was copied from another file. | 490 this file was copied from another file. |
525 | 491 |
526 This is pretty slow when a lot of changesets are involved but will track all | 492 This is pretty slow when a lot of changesets are involved but will track all |
527 the copies. | 493 the copies. |
528 """ | 494 """ |
529 # In certain scenarios (e.g. graft, update or rebase), base can be | |
530 # overridden We still need to know a real common ancestor in this case We | |
531 # can't just compute _c1.ancestor(_c2) and compare it to ca, because there | |
532 # can be multiple common ancestors, e.g. in case of bidmerge. Because our | |
533 # caller may not know if the revision passed in lieu of the CA is a genuine | |
534 # common ancestor or not without explicitly checking it, it's better to | |
535 # determine that here. | |
536 # | |
537 # base.isancestorof(wc) is False, work around that | |
538 _c1 = c1.p1() if c1.rev() is None else c1 | |
539 _c2 = c2.p1() if c2.rev() is None else c2 | |
540 # an endpoint is "dirty" if it isn't a descendant of the merge base | |
541 # if we have a dirty endpoint, we need to trigger graft logic, and also | |
542 # keep track of which endpoint is dirty | |
543 dirtyc1 = not base.isancestorof(_c1) | |
544 dirtyc2 = not base.isancestorof(_c2) | |
545 graft = dirtyc1 or dirtyc2 | |
546 tca = base | |
547 if graft: | |
548 tca = _c1.ancestor(_c2) | |
549 | |
550 limit = _findlimit(repo, c1, c2) | |
551 | |
552 m1 = c1.manifest() | 495 m1 = c1.manifest() |
553 m2 = c2.manifest() | 496 m2 = c2.manifest() |
554 mb = base.manifest() | 497 mb = base.manifest() |
555 | 498 |
556 # gather data from _checkcopies: | 499 copies1 = pathcopies(base, c1) |
557 # - diverge = record all diverges in this dict | 500 copies2 = pathcopies(base, c2) |
558 # - copy = record all non-divergent copies in this dict | 501 |
559 # - fullcopy = record all copies in this dict | 502 inversecopies1 = {} |
560 # - incomplete = record non-divergent partial copies here | 503 inversecopies2 = {} |
561 # - incompletediverge = record divergent partial copies here | 504 for dst, src in copies1.items(): |
562 diverge = {} # divergence data is shared | 505 inversecopies1.setdefault(src, []).append(dst) |
563 incompletediverge = {} | 506 for dst, src in copies2.items(): |
564 data1 = {'copy': {}, | 507 inversecopies2.setdefault(src, []).append(dst) |
565 'fullcopy': {}, | 508 |
566 'incomplete': {}, | 509 copy = {} |
567 'diverge': diverge, | 510 diverge = {} |
568 'incompletediverge': incompletediverge, | 511 renamedelete = {} |
569 } | 512 allsources = set(inversecopies1) | set(inversecopies2) |
570 data2 = {'copy': {}, | 513 for src in allsources: |
571 'fullcopy': {}, | 514 dsts1 = inversecopies1.get(src) |
572 'incomplete': {}, | 515 dsts2 = inversecopies2.get(src) |
573 'diverge': diverge, | 516 if dsts1 and dsts2: |
574 'incompletediverge': incompletediverge, | 517 # copied/renamed on both sides |
575 } | 518 if src not in m1 and src not in m2: |
519 # renamed on both sides | |
520 dsts1 = set(dsts1) | |
521 dsts2 = set(dsts2) | |
522 # If there's some overlap in the rename destinations, we | |
523 # consider it not divergent. For example, if side 1 copies 'a' | |
524 # to 'b' and 'c' and deletes 'a', and side 2 copies 'a' to 'c' | |
525 # and 'd' and deletes 'a'. | |
526 if dsts1 & dsts2: | |
527 for dst in (dsts1 & dsts2): | |
528 copy[dst] = src | |
529 else: | |
530 diverge[src] = sorted(dsts1 | dsts2) | |
531 elif src in m1 and src in m2: | |
532 # copied on both sides | |
533 dsts1 = set(dsts1) | |
534 dsts2 = set(dsts2) | |
535 for dst in (dsts1 & dsts2): | |
536 copy[dst] = src | |
537 # TODO: Handle cases where it was renamed on one side and copied | |
538 # on the other side | |
539 elif dsts1: | |
540 # copied/renamed only on side 1 | |
541 _checksinglesidecopies(src, dsts1, m1, m2, mb, c2, base, | |
542 copy, renamedelete) | |
543 elif dsts2: | |
544 # copied/renamed only on side 2 | |
545 _checksinglesidecopies(src, dsts2, m2, m1, mb, c1, base, | |
546 copy, renamedelete) | |
547 | |
548 renamedeleteset = set() | |
549 divergeset = set() | |
550 for src, dsts in diverge.items(): | |
551 divergeset.update(dsts) | |
552 for src, dsts in renamedelete.items(): | |
553 renamedeleteset.update(dsts) | |
576 | 554 |
577 # find interesting file sets from manifests | 555 # find interesting file sets from manifests |
578 addedinm1 = m1.filesnotin(mb, repo.narrowmatch()) | 556 addedinm1 = m1.filesnotin(mb, repo.narrowmatch()) |
579 addedinm2 = m2.filesnotin(mb, repo.narrowmatch()) | 557 addedinm2 = m2.filesnotin(mb, repo.narrowmatch()) |
580 bothnew = sorted(addedinm1 & addedinm2) | 558 u1, u2 = _computenonoverlap(repo, c1, c2, addedinm1, addedinm2) |
581 if tca == base: | 559 |
582 # unmatched file from base | 560 fullcopy = copies1.copy() |
583 u1r, u2r = _computenonoverlap(repo, c1, c2, addedinm1, addedinm2) | 561 fullcopy.update(copies2) |
584 u1u, u2u = u1r, u2r | |
585 else: | |
586 # unmatched file from base (DAG rotation in the graft case) | |
587 u1r, u2r = _computenonoverlap(repo, c1, c2, addedinm1, addedinm2) | |
588 # unmatched file from topological common ancestors (no DAG rotation) | |
589 # need to recompute this for directory move handling when grafting | |
590 mta = tca.manifest() | |
591 u1u, u2u = _computenonoverlap(repo, c1, c2, | |
592 m1.filesnotin(mta, repo.narrowmatch()), | |
593 m2.filesnotin(mta, repo.narrowmatch()), | |
594 debug=False) | |
595 | |
596 for f in u1u: | |
597 _checkcopies(c1, c2, f, base, tca, dirtyc1, limit, data1) | |
598 | |
599 for f in u2u: | |
600 _checkcopies(c2, c1, f, base, tca, dirtyc2, limit, data2) | |
601 | |
602 copy = dict(data1['copy']) | |
603 copy.update(data2['copy']) | |
604 fullcopy = dict(data1['fullcopy']) | |
605 fullcopy.update(data2['fullcopy']) | |
606 | |
607 if dirtyc1: | |
608 _combinecopies(data2['incomplete'], data1['incomplete'], copy, diverge, | |
609 incompletediverge) | |
610 if dirtyc2: | |
611 _combinecopies(data1['incomplete'], data2['incomplete'], copy, diverge, | |
612 incompletediverge) | |
613 | |
614 renamedelete = {} | |
615 renamedeleteset = set() | |
616 divergeset = set() | |
617 for of, fl in list(diverge.items()): | |
618 if len(fl) == 1 or of in c1 or of in c2: | |
619 del diverge[of] # not actually divergent, or not a rename | |
620 if of not in c1 and of not in c2: | |
621 # renamed on one side, deleted on the other side, but filter | |
622 # out files that have been renamed and then deleted | |
623 renamedelete[of] = [f for f in fl if f in c1 or f in c2] | |
624 renamedeleteset.update(fl) # reverse map for below | |
625 else: | |
626 divergeset.update(fl) # reverse map for below | |
627 | |
628 bothdiverge = {} | |
629 bothincompletediverge = {} | |
630 remainder = {} | |
631 both1 = {'copy': {}, | |
632 'fullcopy': {}, | |
633 'incomplete': {}, | |
634 'diverge': bothdiverge, | |
635 'incompletediverge': bothincompletediverge | |
636 } | |
637 both2 = {'copy': {}, | |
638 'fullcopy': {}, | |
639 'incomplete': {}, | |
640 'diverge': bothdiverge, | |
641 'incompletediverge': bothincompletediverge | |
642 } | |
643 for f in bothnew: | |
644 _checkcopies(c1, c2, f, base, tca, dirtyc1, limit, both1) | |
645 _checkcopies(c2, c1, f, base, tca, dirtyc2, limit, both2) | |
646 if dirtyc1 and dirtyc2: | |
647 remainder = _combinecopies(both2['incomplete'], both1['incomplete'], | |
648 copy, bothdiverge, bothincompletediverge) | |
649 remainder1 = _combinecopies(both1['incomplete'], both2['incomplete'], | |
650 copy, bothdiverge, bothincompletediverge) | |
651 remainder.update(remainder1) | |
652 elif dirtyc1: | |
653 # incomplete copies may only be found on the "dirty" side for bothnew | |
654 assert not both2['incomplete'] | |
655 remainder = _combinecopies({}, both1['incomplete'], copy, bothdiverge, | |
656 bothincompletediverge) | |
657 elif dirtyc2: | |
658 assert not both1['incomplete'] | |
659 remainder = _combinecopies({}, both2['incomplete'], copy, bothdiverge, | |
660 bothincompletediverge) | |
661 else: | |
662 # incomplete copies and divergences can't happen outside grafts | |
663 assert not both1['incomplete'] | |
664 assert not both2['incomplete'] | |
665 assert not bothincompletediverge | |
666 for f in remainder: | |
667 assert f not in bothdiverge | |
668 ic = remainder[f] | |
669 if ic[0] in (m1 if dirtyc1 else m2): | |
670 # backed-out rename on one side, but watch out for deleted files | |
671 bothdiverge[f] = ic | |
672 for of, fl in bothdiverge.items(): | |
673 if len(fl) == 2 and fl[0] == fl[1]: | |
674 copy[fl[0]] = of # not actually divergent, just matching renames | |
675 | |
676 # Sometimes we get invalid copies here (the "and not remotebase" in | |
677 # _checkcopies() seems suspicious). Filter them out. | |
678 for dst, src in fullcopy.copy().items(): | |
679 if src not in mb: | |
680 del fullcopy[dst] | |
681 # Sometimes we forget to add entries from "copy" to "fullcopy", so fix | |
682 # that up here | |
683 for dst, src in copy.items(): | |
684 fullcopy[dst] = src | |
685 # Sometimes we forget to add entries from "diverge" to "fullcopy", so fix | |
686 # that up here | |
687 for src, dsts in diverge.items(): | |
688 for dst in dsts: | |
689 fullcopy[dst] = src | |
690 | |
691 if not fullcopy: | 562 if not fullcopy: |
692 return copy, {}, diverge, renamedelete, {} | 563 return copy, {}, diverge, renamedelete, {} |
693 | 564 |
694 if repo.ui.debugflag: | 565 if repo.ui.debugflag: |
695 repo.ui.debug(" all copies found (* = to merge, ! = divergent, " | 566 repo.ui.debug(" all copies found (* = to merge, ! = divergent, " |
750 repo.ui.debug(" discovered dir src: '%s' -> dst: '%s'\n" % | 621 repo.ui.debug(" discovered dir src: '%s' -> dst: '%s'\n" % |
751 (d, dirmove[d])) | 622 (d, dirmove[d])) |
752 | 623 |
753 movewithdir = {} | 624 movewithdir = {} |
754 # check unaccounted nonoverlapping files against directory moves | 625 # check unaccounted nonoverlapping files against directory moves |
755 for f in u1r + u2r: | 626 for f in u1 + u2: |
756 if f not in fullcopy: | 627 if f not in fullcopy: |
757 for d in dirmove: | 628 for d in dirmove: |
758 if f.startswith(d): | 629 if f.startswith(d): |
759 # new file added in a directory that was moved, move it | 630 # new file added in a directory that was moved, move it |
760 df = dirmove[d] + f[len(d):] | 631 df = dirmove[d] + f[len(d):] |
896 f2 = next(g2) | 767 f2 = next(g2) |
897 else: # f1 and f2 point to files in the same linkrev | 768 else: # f1 and f2 point to files in the same linkrev |
898 return f1 == f2 # true if they point to the same file | 769 return f1 == f2 # true if they point to the same file |
899 except StopIteration: | 770 except StopIteration: |
900 return False | 771 return False |
901 | |
902 def _checkcopies(srcctx, dstctx, f, base, tca, remotebase, limit, data): | |
903 """ | |
904 check possible copies of f from msrc to mdst | |
905 | |
906 srcctx = starting context for f in msrc | |
907 dstctx = destination context for f in mdst | |
908 f = the filename to check (as in msrc) | |
909 base = the changectx used as a merge base | |
910 tca = topological common ancestor for graft-like scenarios | |
911 remotebase = True if base is outside tca::srcctx, False otherwise | |
912 limit = the rev number to not search beyond | |
913 data = dictionary of dictionary to store copy data. (see mergecopies) | |
914 | |
915 note: limit is only an optimization, and provides no guarantee that | |
916 irrelevant revisions will not be visited | |
917 there is no easy way to make this algorithm stop in a guaranteed way | |
918 once it "goes behind a certain revision". | |
919 """ | |
920 | |
921 msrc = srcctx.manifest() | |
922 mdst = dstctx.manifest() | |
923 mb = base.manifest() | |
924 mta = tca.manifest() | |
925 # Might be true if this call is about finding backward renames, | |
926 # This happens in the case of grafts because the DAG is then rotated. | |
927 # If the file exists in both the base and the source, we are not looking | |
928 # for a rename on the source side, but on the part of the DAG that is | |
929 # traversed backwards. | |
930 # | |
931 # In the case there is both backward and forward renames (before and after | |
932 # the base) this is more complicated as we must detect a divergence. | |
933 # We use 'backwards = False' in that case. | |
934 backwards = not remotebase and base != tca and f in mb | |
935 getsrcfctx = _makegetfctx(srcctx) | |
936 getdstfctx = _makegetfctx(dstctx) | |
937 | |
938 if msrc[f] == mb.get(f) and not remotebase: | |
939 # Nothing to merge | |
940 return | |
941 | |
942 of = None | |
943 seen = {f} | |
944 for oc in getsrcfctx(f, msrc[f]).ancestors(): | |
945 of = oc.path() | |
946 if of in seen: | |
947 # check limit late - grab last rename before | |
948 if oc.linkrev() < limit: | |
949 break | |
950 continue | |
951 seen.add(of) | |
952 | |
953 # remember for dir rename detection | |
954 if backwards: | |
955 data['fullcopy'][of] = f # grafting backwards through renames | |
956 else: | |
957 data['fullcopy'][f] = of | |
958 if of not in mdst: | |
959 continue # no match, keep looking | |
960 if mdst[of] == mb.get(of): | |
961 return # no merge needed, quit early | |
962 c2 = getdstfctx(of, mdst[of]) | |
963 # c2 might be a plain new file on added on destination side that is | |
964 # unrelated to the droids we are looking for. | |
965 cr = _related(oc, c2) | |
966 if cr and (of == f or of == c2.path()): # non-divergent | |
967 if backwards: | |
968 data['copy'][of] = f | |
969 elif of in mb: | |
970 data['copy'][f] = of | |
971 elif remotebase: # special case: a <- b <- a -> b "ping-pong" rename | |
972 data['copy'][of] = f | |
973 del data['fullcopy'][f] | |
974 data['fullcopy'][of] = f | |
975 else: # divergence w.r.t. graft CA on one side of topological CA | |
976 for sf in seen: | |
977 if sf in mb: | |
978 assert sf not in data['diverge'] | |
979 data['diverge'][sf] = [f, of] | |
980 break | |
981 return | |
982 | |
983 if of in mta: | |
984 if backwards or remotebase: | |
985 data['incomplete'][of] = f | |
986 else: | |
987 for sf in seen: | |
988 if sf in mb: | |
989 if tca == base: | |
990 data['diverge'].setdefault(sf, []).append(f) | |
991 else: | |
992 data['incompletediverge'][sf] = [of, f] | |
993 return | |
994 | 772 |
995 def duplicatecopies(repo, wctx, rev, fromrev, skiprev=None): | 773 def duplicatecopies(repo, wctx, rev, fromrev, skiprev=None): |
996 """reproduce copies from fromrev to rev in the dirstate | 774 """reproduce copies from fromrev to rev in the dirstate |
997 | 775 |
998 If skiprev is specified, it's a revision that should be used to | 776 If skiprev is specified, it's a revision that should be used to |