fix: reduce number of tool executions
authorDanny Hooper <hooper@google.com>
Thu, 02 Sep 2021 14:08:45 -0700
changeset 48190 f12a19d03d2c
parent 48189 066cdec8f74f
child 48191 67d14d4e036c
fix: reduce number of tool executions By grouping together (path, ctx) pairs according to the inputs they would provide to fixer tools, we can deduplicate executions of fixer tools to significantly reduce the amount of time spent running slow tools. This change does not handle clean files in the working copy, which could still be deduplicated against the files in the checked out commit. It's a little harder to do that because the filerev is not available in the workingfilectx (and it doesn't exist for added files). Anecdotally, this change makes some real uses cases at Google 10x faster. I think we were originally hesitant to do this because the benefits weren't obvious, and implementing it efficiently is kind of tricky. If we simply memoized the formatter execution function, we would be keeping tons of file content in memory. Also included is a regression test for a corner case that I broke with my first attempt at optimizing this code. Differential Revision: https://phab.mercurial-scm.org/D11280
hgext/fix.py
tests/test-fix.t
--- a/hgext/fix.py	Thu Sep 02 14:07:55 2021 -0700
+++ b/hgext/fix.py	Thu Sep 02 14:08:45 2021 -0700
@@ -284,20 +284,29 @@
         # There are no data dependencies between the workers fixing each file
         # revision, so we can use all available parallelism.
         def getfixes(items):
-            for rev, path in items:
-                ctx = repo[rev]
+            for srcrev, path, dstrevs in items:
+                ctx = repo[srcrev]
                 olddata = ctx[path].data()
                 metadata, newdata = fixfile(
-                    ui, repo, opts, fixers, ctx, path, basepaths, basectxs[rev]
+                    ui,
+                    repo,
+                    opts,
+                    fixers,
+                    ctx,
+                    path,
+                    basepaths,
+                    basectxs[srcrev],
                 )
-                # Don't waste memory/time passing unchanged content back, but
-                # produce one result per item either way.
-                yield (
-                    rev,
-                    path,
-                    metadata,
-                    newdata if newdata != olddata else None,
-                )
+                # We ungroup the work items now, because the code that consumes
+                # these results has to handle each dstrev separately, and in
+                # topological order. Because these are handled in topological
+                # order, it's important that we pass around references to
+                # "newdata" instead of copying it. Otherwise, we would be
+                # keeping more copies of file content in memory at a time than
+                # if we hadn't bothered to group/deduplicate the work items.
+                data = newdata if newdata != olddata else None
+                for dstrev in dstrevs:
+                    yield (dstrev, path, metadata, data)
 
         results = worker.worker(
             ui, 1.0, getfixes, tuple(), workqueue, threadsafe=False
@@ -377,23 +386,32 @@
 
 
 def getworkqueue(ui, repo, pats, opts, revstofix, basectxs):
-    """Constructs the list of files to be fixed at specific revisions
+    """Constructs a list of files to fix and which revisions each fix applies to
 
-    It is up to the caller how to consume the work items, and the only
-    dependence between them is that replacement revisions must be committed in
-    topological order. Each work item represents a file in the working copy or
-    in some revision that should be fixed and written back to the working copy
-    or into a replacement revision.
+    To avoid duplicating work, there is usually only one work item for each file
+    revision that might need to be fixed. There can be multiple work items per
+    file revision if the same file needs to be fixed in multiple changesets with
+    different baserevs. Each work item also contains a list of changesets where
+    the file's data should be replaced with the fixed data. The work items for
+    earlier changesets come earlier in the work queue, to improve pipelining by
+    allowing the first changeset to be replaced while fixes are still being
+    computed for later changesets.
 
-    Work items for the same revision are grouped together, so that a worker
-    pool starting with the first N items in parallel is likely to finish the
-    first revision's work before other revisions. This can allow us to write
-    the result to disk and reduce memory footprint. At time of writing, the
-    partition strategy in worker.py seems favorable to this. We also sort the
-    items by ascending revision number to match the order in which we commit
-    the fixes later.
+    Also returned is a map from changesets to the count of work items that might
+    affect each changeset. This is used later to count when all of a changeset's
+    work items have been finished, without having to inspect the remaining work
+    queue in each worker subprocess.
+
+    The example work item (1, "foo/bar.txt", (1, 2, 3)) means that the data of
+    bar.txt should be read from revision 1, then fixed, and written back to
+    revisions 1, 2 and 3. Revision 1 is called the "srcrev" and the list of
+    revisions is called the "dstrevs". In practice the srcrev is always one of
+    the dstrevs, and we make that choice when constructing the work item so that
+    the choice can't be made inconsistently later on. The dstrevs should all
+    have the same file revision for the given path, so the choice of srcrev is
+    arbitrary. The wdirrev can be a dstrev and a srcrev.
     """
-    workqueue = []
+    dstrevmap = collections.defaultdict(list)
     numitems = collections.defaultdict(int)
     maxfilesize = ui.configbytes(b'fix', b'maxfilesize')
     for rev in sorted(revstofix):
@@ -411,8 +429,21 @@
                     % (util.bytecount(maxfilesize), path)
                 )
                 continue
-            workqueue.append((rev, path))
+            baserevs = tuple(ctx.rev() for ctx in basectxs[rev])
+            dstrevmap[(fctx.filerev(), baserevs, path)].append(rev)
             numitems[rev] += 1
+    workqueue = [
+        (min(dstrevs), path, dstrevs)
+        for (filerev, baserevs, path), dstrevs in dstrevmap.items()
+    ]
+    # Move work items for earlier changesets to the front of the queue, so we
+    # might be able to replace those changesets (in topological order) while
+    # we're still processing later work items. Note the min() in the previous
+    # expression, which means we don't need a custom comparator here. The path
+    # is also important in the sort order to make the output order stable. There
+    # are some situations where this doesn't help much, but some situations
+    # where it lets us buffer O(1) files instead of O(n) files.
+    workqueue.sort()
     return workqueue, numitems
 
 
@@ -517,9 +548,9 @@
         return {}
 
     basepaths = {}
-    for rev, path in workqueue:
-        fixctx = repo[rev]
-        for basectx in basectxs[rev]:
+    for srcrev, path, _dstrevs in workqueue:
+        fixctx = repo[srcrev]
+        for basectx in basectxs[srcrev]:
             basepath = copies.pathcopies(basectx, fixctx).get(path, path)
             if basepath in basectx:
                 basepaths[(basectx.rev(), fixctx.rev(), path)] = basepath
@@ -642,10 +673,10 @@
     toprefetch = set()
 
     # Prefetch the files that will be fixed.
-    for rev, path in workqueue:
-        if rev == wdirrev:
+    for srcrev, path, _dstrevs in workqueue:
+        if srcrev == wdirrev:
             continue
-        toprefetch.add((rev, path))
+        toprefetch.add((srcrev, path))
 
     # Prefetch the base contents for lineranges().
     for (baserev, fixrev, path), basepath in basepaths.items():
--- a/tests/test-fix.t	Thu Sep 02 14:07:55 2021 -0700
+++ b/tests/test-fix.t	Thu Sep 02 14:08:45 2021 -0700
@@ -1797,7 +1797,56 @@
   $ cat $LOGFILE | sort | uniq -c
         4 bar.log
         4 baz.log
-        4 foo.log
-        4 qux.log
+        3 foo.log
+        2 qux.log
 
   $ cd ..
+
+For tools that support line ranges, it's wrong to blindly re-use fixed file
+content for the same file revision if it appears twice with different baserevs,
+because the line ranges could be different. Since computing line ranges is
+ambiguous, this isn't a matter of correctness, but it affects the usability of
+this extension. It could maybe be simpler if baserevs were computed on a
+per-file basis to make this situation impossible to construct.
+
+In the following example, we construct two subgraphs with the same file
+revisions, and fix different sub-subgraphs to get different baserevs and
+different changed line ranges. The key precondition is that revisions 1 and 4
+have the same file revision, and the key result is that their successors don't
+have the same file content, because we want to fix different areas of that same
+file revision's content.
+
+  $ hg init differentlineranges
+  $ cd differentlineranges
+
+  $ printf "a\nb\n" > file.changed
+  $ hg commit -Aqm "0 ab"
+  $ printf "a\nx\n" > file.changed
+  $ hg commit -Aqm "1 ax"
+  $ hg remove file.changed
+  $ hg commit -Aqm "2 removed"
+  $ hg revert file.changed -r 0
+  $ hg commit -Aqm "3 ab (reverted)"
+  $ hg revert file.changed -r 1
+  $ hg commit -Aqm "4 ax (reverted)"
+
+  $ hg manifest --debug --template "{hash}\n" -r 0; \
+  > hg manifest --debug --template "{hash}\n" -r 3
+  418f692145676128d2fb518b027ddbac624be76e
+  418f692145676128d2fb518b027ddbac624be76e
+  $ hg manifest --debug --template "{hash}\n" -r 1; \
+  > hg manifest --debug --template "{hash}\n" -r 4
+  09b8b3ce5a507caaa282f7262679e6d04091426c
+  09b8b3ce5a507caaa282f7262679e6d04091426c
+
+  $ hg fix --working-dir -r 1+3+4
+  3 new orphan changesets
+
+  $ hg cat file.changed -r "successors(1)" --hidden
+  a
+  X
+  $ hg cat file.changed -r "successors(4)" --hidden
+  A
+  X
+
+  $ cd ..