worker: document poor partitioning scheme impact
authorGregory Szorc <gregory.szorc@gmail.com>
Sat, 27 Feb 2016 21:43:17 -0800
changeset 28292 3eb7faf6d958
parent 28286 c7f89ad87bae
child 28293 a22b6fa5a844
worker: document poor partitioning scheme impact mpm isn't a fan of the existing or previous partitioning scheme. He provided a fantastic justification for why on the mailing list. This patch adds his words to the code so they aren't forgotten.
mercurial/worker.py
--- a/mercurial/worker.py	Mon Feb 29 17:52:17 2016 -0600
+++ b/mercurial/worker.py	Sat Feb 27 21:43:17 2016 -0800
@@ -157,6 +157,28 @@
     The current strategy takes every Nth element from the input. If
     we ever write workers that need to preserve grouping in input
     we should consider allowing callers to specify a partition strategy.
+
+    mpm is not a fan of this partitioning strategy when files are involved.
+    In his words:
+
+        Single-threaded Mercurial makes a point of creating and visiting
+        files in a fixed order (alphabetical). When creating files in order,
+        a typical filesystem is likely to allocate them on nearby regions on
+        disk. Thus, when revisiting in the same order, locality is maximized
+        and various forms of OS and disk-level caching and read-ahead get a
+        chance to work.
+
+        This effect can be quite significant on spinning disks. I discovered it
+        circa Mercurial v0.4 when revlogs were named by hashes of filenames.
+        Tarring a repo and copying it to another disk effectively randomized
+        the revlog ordering on disk by sorting the revlogs by hash and suddenly
+        performance of my kernel checkout benchmark dropped by ~10x because the
+        "working set" of sectors visited no longer fit in the drive's cache and
+        the workload switched from streaming to random I/O.
+
+        What we should really be doing is have workers read filenames from a
+        ordered queue. This preserves locality and also keeps any worker from
+        getting more than one file out of balance.
     '''
     for i in range(nslices):
         yield lst[i::nslices]