comparison mercurial/copies.py @ 42222:57203e0210f8

copies: calculate mergecopies() based on pathcopies() When copies are stored in changesets, we need a changeset-centric version of mergecopies() just like we have a changeset-centric version of pathcopies(). I think the natural way of thinking about mergecopies() is in terms of pathcopies() from the base to each of the commits. So if we can rewrite mergecopies() based on two such pathcopies() calls, we'll get the changeset-centric version for free. That's what this patch does. A nice bonus is that it ends up being a lot simpler. mergecopies() has accumulated a lot of technical debt over time. One good example is the code for dealing with grafts (the "partial/incomplete/dirty" stuff). Since pathcopies() already deals with backwards renames and ping-pong renames, we get that for free. I've run tests with hard-coded debug logging for "fullcopy" and while I haven't looked at every difference it produces, all the ones I have looked at seemed reasonable to me. I'm a little surprised that no more tests fail when run with '--extra-config-opt experimental.copies.read-from=compatibility' compared to before this patch. This patch also fixes the broken cases in test-annotate.t and test-fastannotate.t. It also enables the part of test-copies.t that was previously disabled exactly because mergecopies() needed to get a changeset-centric version. One drawback of the rewritten code is that we may now make remotefilelog prefetch more files. We used to prefetch files that were unique to either side of the merge compared to the other. We now prefetch files that are unique to either side of the merge compared to the base. This means that if you added the same file to each side, we would not prefetch it before, but we would now. Such cases are probably quite rare, but one likely scenario where they happen is when moving from a commit to its successor (or the other way around). The user will probably already have the files in the cache in such cases, so it's probably not a big deal. Some timings for calculating mergecopies between two revisions (revisions shown on each line, all using the common ancestor as base): In the hg repo: 4.8 4.9: 0.21s -> 0.21s 4.0 4.8: 0.35s -> 0.63s In and old copy of the mozilla-unified repo: FIREFOX_BETA_60_BASE^ FIREFOX_BETA_60_BASE: 0.82s -> 0.82s FIREFOX_NIGHTLY_59_END FIREFOX_BETA_60_BASE: 2.5s -> 2.6s FIREFOX_BETA_59_END FIREFOX_BETA_60_BASE: 3.9s -> 4.1s FIREFOX_AURORA_50_BASE FIREFOX_BETA_60_BASE: 31s -> 33s So it's measurably slower in most cases. The most significant difference is in the hg repo between revisions 4.0 and 4.8. In that case it seems to come from the fact that pathcopies() uses fctx.isintroducedafter() (in _tracefile), while the old mergecopies() used fctx.linkrev() (in _checkcopies()). That results in a single call to filectx._adjustlinkrev(), which is responsible for the entire difference in time (in my repo). So we pay a performance penalty but we get more correct code (see change in test-mv-cp-st-diff.t). Deleting the "== f.filenode()" in _tracefile() recovers the lost performance in the hg repo. There were are few other optimizations in _checkcopies() that I could not measure any impact from. One was from the "seen" set. Another was from a "continue" when the file was not in the destination manifest (corresponding to "am" in _tracefile). Also note that merge copies are not calculated when updating with a clean working copy, which is probably the most common case. I therefore think the much simpler code is worth the slowdown. Differential Revision: https://phab.mercurial-scm.org/D6255
author Martin von Zweigbergk <martinvonz@google.com>
date Thu, 11 Apr 2019 23:22:54 -0700
parents 390ec72b8ea4
children d69bc8ffbe6f
comparison
equal deleted inserted replaced
42221:c83c08cf02b7 42222:57203e0210f8
371 if u2: 371 if u2:
372 repo.ui.debug("%s:\n %s\n" % (header % 'other', "\n ".join(u2))) 372 repo.ui.debug("%s:\n %s\n" % (header % 'other', "\n ".join(u2)))
373 373
374 return u1, u2 374 return u1, u2
375 375
376 def _makegetfctx(ctx):
377 """return a 'getfctx' function suitable for _checkcopies usage
378
379 We have to re-setup the function building 'filectx' for each
380 '_checkcopies' to ensure the linkrev adjustment is properly setup for
381 each. Linkrev adjustment is important to avoid bug in rename
382 detection. Moreover, having a proper '_ancestrycontext' setup ensures
383 the performance impact of this adjustment is kept limited. Without it,
384 each file could do a full dag traversal making the time complexity of
385 the operation explode (see issue4537).
386
387 This function exists here mostly to limit the impact on stable. Feel
388 free to refactor on default.
389 """
390 rev = ctx.rev()
391 repo = ctx._repo
392 ac = getattr(ctx, '_ancestrycontext', None)
393 if ac is None:
394 revs = [rev]
395 if rev is None:
396 revs = [p.rev() for p in ctx.parents()]
397 ac = repo.changelog.ancestors(revs, inclusive=True)
398 ctx._ancestrycontext = ac
399 def makectx(f, n):
400 if n in node.wdirfilenodeids: # in a working context?
401 if ctx.rev() is None:
402 return ctx.filectx(f)
403 return repo[None][f]
404 fctx = repo.filectx(f, fileid=n)
405 # setup only needed for filectx not create from a changectx
406 fctx._ancestrycontext = ac
407 fctx._descendantrev = rev
408 return fctx
409 return util.lrucachefunc(makectx)
410
411 def _combinecopies(copyfrom, copyto, finalcopy, diverge, incompletediverge):
412 """combine partial copy paths"""
413 remainder = {}
414 for f in copyfrom:
415 if f in copyto:
416 finalcopy[copyto[f]] = copyfrom[f]
417 del copyto[f]
418 for f in incompletediverge:
419 assert f not in diverge
420 ic = incompletediverge[f]
421 if ic[0] in copyto:
422 diverge[f] = [copyto[ic[0]], ic[1]]
423 else:
424 remainder[f] = ic
425 return remainder
426
427 def mergecopies(repo, c1, c2, base): 376 def mergecopies(repo, c1, c2, base):
428 """ 377 """
429 Finds moves and copies between context c1 and c2 that are relevant for 378 Finds moves and copies between context c1 and c2 that are relevant for
430 merging. 'base' will be used as the merge base. 379 merging. 'base' will be used as the merge base.
431 380
516 'copytrace.sourcecommitlimit') 465 'copytrace.sourcecommitlimit')
517 commits = len(repo.revs('%d::%d', base.rev(), c1.rev())) 466 commits = len(repo.revs('%d::%d', base.rev(), c1.rev()))
518 return commits < sourcecommitlimit 467 return commits < sourcecommitlimit
519 return False 468 return False
520 469
470 def _checksinglesidecopies(src, dsts1, m1, m2, mb, c2, base,
471 copy, renamedelete):
472 if src not in m2:
473 # deleted on side 2
474 if src not in m1:
475 # renamed on side 1, deleted on side 2
476 renamedelete[src] = dsts1
477 elif m2[src] != mb[src]:
478 if not _related(c2[src], base[src]):
479 return
480 # modified on side 2
481 for dst in dsts1:
482 if dst not in m2:
483 # dst not added on side 2 (handle as regular
484 # "both created" case in manifestmerge otherwise)
485 copy[dst] = src
486
521 def _fullcopytracing(repo, c1, c2, base): 487 def _fullcopytracing(repo, c1, c2, base):
522 """ The full copytracing algorithm which finds all the new files that were 488 """ The full copytracing algorithm which finds all the new files that were
523 added from merge base up to the top commit and for each file it checks if 489 added from merge base up to the top commit and for each file it checks if
524 this file was copied from another file. 490 this file was copied from another file.
525 491
526 This is pretty slow when a lot of changesets are involved but will track all 492 This is pretty slow when a lot of changesets are involved but will track all
527 the copies. 493 the copies.
528 """ 494 """
529 # In certain scenarios (e.g. graft, update or rebase), base can be
530 # overridden We still need to know a real common ancestor in this case We
531 # can't just compute _c1.ancestor(_c2) and compare it to ca, because there
532 # can be multiple common ancestors, e.g. in case of bidmerge. Because our
533 # caller may not know if the revision passed in lieu of the CA is a genuine
534 # common ancestor or not without explicitly checking it, it's better to
535 # determine that here.
536 #
537 # base.isancestorof(wc) is False, work around that
538 _c1 = c1.p1() if c1.rev() is None else c1
539 _c2 = c2.p1() if c2.rev() is None else c2
540 # an endpoint is "dirty" if it isn't a descendant of the merge base
541 # if we have a dirty endpoint, we need to trigger graft logic, and also
542 # keep track of which endpoint is dirty
543 dirtyc1 = not base.isancestorof(_c1)
544 dirtyc2 = not base.isancestorof(_c2)
545 graft = dirtyc1 or dirtyc2
546 tca = base
547 if graft:
548 tca = _c1.ancestor(_c2)
549
550 limit = _findlimit(repo, c1, c2)
551
552 m1 = c1.manifest() 495 m1 = c1.manifest()
553 m2 = c2.manifest() 496 m2 = c2.manifest()
554 mb = base.manifest() 497 mb = base.manifest()
555 498
556 # gather data from _checkcopies: 499 copies1 = pathcopies(base, c1)
557 # - diverge = record all diverges in this dict 500 copies2 = pathcopies(base, c2)
558 # - copy = record all non-divergent copies in this dict 501
559 # - fullcopy = record all copies in this dict 502 inversecopies1 = {}
560 # - incomplete = record non-divergent partial copies here 503 inversecopies2 = {}
561 # - incompletediverge = record divergent partial copies here 504 for dst, src in copies1.items():
562 diverge = {} # divergence data is shared 505 inversecopies1.setdefault(src, []).append(dst)
563 incompletediverge = {} 506 for dst, src in copies2.items():
564 data1 = {'copy': {}, 507 inversecopies2.setdefault(src, []).append(dst)
565 'fullcopy': {}, 508
566 'incomplete': {}, 509 copy = {}
567 'diverge': diverge, 510 diverge = {}
568 'incompletediverge': incompletediverge, 511 renamedelete = {}
569 } 512 allsources = set(inversecopies1) | set(inversecopies2)
570 data2 = {'copy': {}, 513 for src in allsources:
571 'fullcopy': {}, 514 dsts1 = inversecopies1.get(src)
572 'incomplete': {}, 515 dsts2 = inversecopies2.get(src)
573 'diverge': diverge, 516 if dsts1 and dsts2:
574 'incompletediverge': incompletediverge, 517 # copied/renamed on both sides
575 } 518 if src not in m1 and src not in m2:
519 # renamed on both sides
520 dsts1 = set(dsts1)
521 dsts2 = set(dsts2)
522 # If there's some overlap in the rename destinations, we
523 # consider it not divergent. For example, if side 1 copies 'a'
524 # to 'b' and 'c' and deletes 'a', and side 2 copies 'a' to 'c'
525 # and 'd' and deletes 'a'.
526 if dsts1 & dsts2:
527 for dst in (dsts1 & dsts2):
528 copy[dst] = src
529 else:
530 diverge[src] = sorted(dsts1 | dsts2)
531 elif src in m1 and src in m2:
532 # copied on both sides
533 dsts1 = set(dsts1)
534 dsts2 = set(dsts2)
535 for dst in (dsts1 & dsts2):
536 copy[dst] = src
537 # TODO: Handle cases where it was renamed on one side and copied
538 # on the other side
539 elif dsts1:
540 # copied/renamed only on side 1
541 _checksinglesidecopies(src, dsts1, m1, m2, mb, c2, base,
542 copy, renamedelete)
543 elif dsts2:
544 # copied/renamed only on side 2
545 _checksinglesidecopies(src, dsts2, m2, m1, mb, c1, base,
546 copy, renamedelete)
547
548 renamedeleteset = set()
549 divergeset = set()
550 for src, dsts in diverge.items():
551 divergeset.update(dsts)
552 for src, dsts in renamedelete.items():
553 renamedeleteset.update(dsts)
576 554
577 # find interesting file sets from manifests 555 # find interesting file sets from manifests
578 addedinm1 = m1.filesnotin(mb, repo.narrowmatch()) 556 addedinm1 = m1.filesnotin(mb, repo.narrowmatch())
579 addedinm2 = m2.filesnotin(mb, repo.narrowmatch()) 557 addedinm2 = m2.filesnotin(mb, repo.narrowmatch())
580 bothnew = sorted(addedinm1 & addedinm2) 558 u1, u2 = _computenonoverlap(repo, c1, c2, addedinm1, addedinm2)
581 if tca == base: 559
582 # unmatched file from base 560 fullcopy = copies1.copy()
583 u1r, u2r = _computenonoverlap(repo, c1, c2, addedinm1, addedinm2) 561 fullcopy.update(copies2)
584 u1u, u2u = u1r, u2r
585 else:
586 # unmatched file from base (DAG rotation in the graft case)
587 u1r, u2r = _computenonoverlap(repo, c1, c2, addedinm1, addedinm2)
588 # unmatched file from topological common ancestors (no DAG rotation)
589 # need to recompute this for directory move handling when grafting
590 mta = tca.manifest()
591 u1u, u2u = _computenonoverlap(repo, c1, c2,
592 m1.filesnotin(mta, repo.narrowmatch()),
593 m2.filesnotin(mta, repo.narrowmatch()),
594 debug=False)
595
596 for f in u1u:
597 _checkcopies(c1, c2, f, base, tca, dirtyc1, limit, data1)
598
599 for f in u2u:
600 _checkcopies(c2, c1, f, base, tca, dirtyc2, limit, data2)
601
602 copy = dict(data1['copy'])
603 copy.update(data2['copy'])
604 fullcopy = dict(data1['fullcopy'])
605 fullcopy.update(data2['fullcopy'])
606
607 if dirtyc1:
608 _combinecopies(data2['incomplete'], data1['incomplete'], copy, diverge,
609 incompletediverge)
610 if dirtyc2:
611 _combinecopies(data1['incomplete'], data2['incomplete'], copy, diverge,
612 incompletediverge)
613
614 renamedelete = {}
615 renamedeleteset = set()
616 divergeset = set()
617 for of, fl in list(diverge.items()):
618 if len(fl) == 1 or of in c1 or of in c2:
619 del diverge[of] # not actually divergent, or not a rename
620 if of not in c1 and of not in c2:
621 # renamed on one side, deleted on the other side, but filter
622 # out files that have been renamed and then deleted
623 renamedelete[of] = [f for f in fl if f in c1 or f in c2]
624 renamedeleteset.update(fl) # reverse map for below
625 else:
626 divergeset.update(fl) # reverse map for below
627
628 bothdiverge = {}
629 bothincompletediverge = {}
630 remainder = {}
631 both1 = {'copy': {},
632 'fullcopy': {},
633 'incomplete': {},
634 'diverge': bothdiverge,
635 'incompletediverge': bothincompletediverge
636 }
637 both2 = {'copy': {},
638 'fullcopy': {},
639 'incomplete': {},
640 'diverge': bothdiverge,
641 'incompletediverge': bothincompletediverge
642 }
643 for f in bothnew:
644 _checkcopies(c1, c2, f, base, tca, dirtyc1, limit, both1)
645 _checkcopies(c2, c1, f, base, tca, dirtyc2, limit, both2)
646 if dirtyc1 and dirtyc2:
647 remainder = _combinecopies(both2['incomplete'], both1['incomplete'],
648 copy, bothdiverge, bothincompletediverge)
649 remainder1 = _combinecopies(both1['incomplete'], both2['incomplete'],
650 copy, bothdiverge, bothincompletediverge)
651 remainder.update(remainder1)
652 elif dirtyc1:
653 # incomplete copies may only be found on the "dirty" side for bothnew
654 assert not both2['incomplete']
655 remainder = _combinecopies({}, both1['incomplete'], copy, bothdiverge,
656 bothincompletediverge)
657 elif dirtyc2:
658 assert not both1['incomplete']
659 remainder = _combinecopies({}, both2['incomplete'], copy, bothdiverge,
660 bothincompletediverge)
661 else:
662 # incomplete copies and divergences can't happen outside grafts
663 assert not both1['incomplete']
664 assert not both2['incomplete']
665 assert not bothincompletediverge
666 for f in remainder:
667 assert f not in bothdiverge
668 ic = remainder[f]
669 if ic[0] in (m1 if dirtyc1 else m2):
670 # backed-out rename on one side, but watch out for deleted files
671 bothdiverge[f] = ic
672 for of, fl in bothdiverge.items():
673 if len(fl) == 2 and fl[0] == fl[1]:
674 copy[fl[0]] = of # not actually divergent, just matching renames
675
676 # Sometimes we get invalid copies here (the "and not remotebase" in
677 # _checkcopies() seems suspicious). Filter them out.
678 for dst, src in fullcopy.copy().items():
679 if src not in mb:
680 del fullcopy[dst]
681 # Sometimes we forget to add entries from "copy" to "fullcopy", so fix
682 # that up here
683 for dst, src in copy.items():
684 fullcopy[dst] = src
685 # Sometimes we forget to add entries from "diverge" to "fullcopy", so fix
686 # that up here
687 for src, dsts in diverge.items():
688 for dst in dsts:
689 fullcopy[dst] = src
690
691 if not fullcopy: 562 if not fullcopy:
692 return copy, {}, diverge, renamedelete, {} 563 return copy, {}, diverge, renamedelete, {}
693 564
694 if repo.ui.debugflag: 565 if repo.ui.debugflag:
695 repo.ui.debug(" all copies found (* = to merge, ! = divergent, " 566 repo.ui.debug(" all copies found (* = to merge, ! = divergent, "
750 repo.ui.debug(" discovered dir src: '%s' -> dst: '%s'\n" % 621 repo.ui.debug(" discovered dir src: '%s' -> dst: '%s'\n" %
751 (d, dirmove[d])) 622 (d, dirmove[d]))
752 623
753 movewithdir = {} 624 movewithdir = {}
754 # check unaccounted nonoverlapping files against directory moves 625 # check unaccounted nonoverlapping files against directory moves
755 for f in u1r + u2r: 626 for f in u1 + u2:
756 if f not in fullcopy: 627 if f not in fullcopy:
757 for d in dirmove: 628 for d in dirmove:
758 if f.startswith(d): 629 if f.startswith(d):
759 # new file added in a directory that was moved, move it 630 # new file added in a directory that was moved, move it
760 df = dirmove[d] + f[len(d):] 631 df = dirmove[d] + f[len(d):]
896 f2 = next(g2) 767 f2 = next(g2)
897 else: # f1 and f2 point to files in the same linkrev 768 else: # f1 and f2 point to files in the same linkrev
898 return f1 == f2 # true if they point to the same file 769 return f1 == f2 # true if they point to the same file
899 except StopIteration: 770 except StopIteration:
900 return False 771 return False
901
902 def _checkcopies(srcctx, dstctx, f, base, tca, remotebase, limit, data):
903 """
904 check possible copies of f from msrc to mdst
905
906 srcctx = starting context for f in msrc
907 dstctx = destination context for f in mdst
908 f = the filename to check (as in msrc)
909 base = the changectx used as a merge base
910 tca = topological common ancestor for graft-like scenarios
911 remotebase = True if base is outside tca::srcctx, False otherwise
912 limit = the rev number to not search beyond
913 data = dictionary of dictionary to store copy data. (see mergecopies)
914
915 note: limit is only an optimization, and provides no guarantee that
916 irrelevant revisions will not be visited
917 there is no easy way to make this algorithm stop in a guaranteed way
918 once it "goes behind a certain revision".
919 """
920
921 msrc = srcctx.manifest()
922 mdst = dstctx.manifest()
923 mb = base.manifest()
924 mta = tca.manifest()
925 # Might be true if this call is about finding backward renames,
926 # This happens in the case of grafts because the DAG is then rotated.
927 # If the file exists in both the base and the source, we are not looking
928 # for a rename on the source side, but on the part of the DAG that is
929 # traversed backwards.
930 #
931 # In the case there is both backward and forward renames (before and after
932 # the base) this is more complicated as we must detect a divergence.
933 # We use 'backwards = False' in that case.
934 backwards = not remotebase and base != tca and f in mb
935 getsrcfctx = _makegetfctx(srcctx)
936 getdstfctx = _makegetfctx(dstctx)
937
938 if msrc[f] == mb.get(f) and not remotebase:
939 # Nothing to merge
940 return
941
942 of = None
943 seen = {f}
944 for oc in getsrcfctx(f, msrc[f]).ancestors():
945 of = oc.path()
946 if of in seen:
947 # check limit late - grab last rename before
948 if oc.linkrev() < limit:
949 break
950 continue
951 seen.add(of)
952
953 # remember for dir rename detection
954 if backwards:
955 data['fullcopy'][of] = f # grafting backwards through renames
956 else:
957 data['fullcopy'][f] = of
958 if of not in mdst:
959 continue # no match, keep looking
960 if mdst[of] == mb.get(of):
961 return # no merge needed, quit early
962 c2 = getdstfctx(of, mdst[of])
963 # c2 might be a plain new file on added on destination side that is
964 # unrelated to the droids we are looking for.
965 cr = _related(oc, c2)
966 if cr and (of == f or of == c2.path()): # non-divergent
967 if backwards:
968 data['copy'][of] = f
969 elif of in mb:
970 data['copy'][f] = of
971 elif remotebase: # special case: a <- b <- a -> b "ping-pong" rename
972 data['copy'][of] = f
973 del data['fullcopy'][f]
974 data['fullcopy'][of] = f
975 else: # divergence w.r.t. graft CA on one side of topological CA
976 for sf in seen:
977 if sf in mb:
978 assert sf not in data['diverge']
979 data['diverge'][sf] = [f, of]
980 break
981 return
982
983 if of in mta:
984 if backwards or remotebase:
985 data['incomplete'][of] = f
986 else:
987 for sf in seen:
988 if sf in mb:
989 if tca == base:
990 data['diverge'].setdefault(sf, []).append(f)
991 else:
992 data['incompletediverge'][sf] = [of, f]
993 return
994 772
995 def duplicatecopies(repo, wctx, rev, fromrev, skiprev=None): 773 def duplicatecopies(repo, wctx, rev, fromrev, skiprev=None):
996 """reproduce copies from fromrev to rev in the dirstate 774 """reproduce copies from fromrev to rev in the dirstate
997 775
998 If skiprev is specified, it's a revision that should be used to 776 If skiprev is specified, it's a revision that should be used to