Mon, 13 Nov 2017 21:54:46 -0800 bundle2: inline struct operations
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 13 Nov 2017 21:54:46 -0800] rev 35119
bundle2: inline struct operations Before, we were calling struct.unpack() (via an alias) on every loop iteration. I'm not sure what Python does under the hood, but it would have to look at the struct format and determine what to do. This commit establishes a struct.Struct instance and reuses it for struct reading. We can see the impact from running `hg perfbundleread` on a Firefox bundle: ! read(8k) ! wall 0.679730 comb 0.680000 user 0.140000 sys 0.540000 (best of 15) ! read(16k) ! wall 0.577228 comb 0.570000 user 0.080000 sys 0.490000 (best of 17) ! read(32k) ! wall 0.516060 comb 0.520000 user 0.040000 sys 0.480000 (best of 20) ! read(128k) ! wall 0.496378 comb 0.490000 user 0.010000 sys 0.480000 (best of 20) ! bundle2 iterparts() ! wall 3.056811 comb 3.050000 user 2.340000 sys 0.710000 (best of 4) ! wall 2.992605 comb 2.990000 user 2.260000 sys 0.730000 (best of 4) ! bundle2 iterparts() seekable ! wall 4.007676 comb 4.000000 user 3.170000 sys 0.830000 (best of 3) ! wall 3.863810 comb 3.860000 user 3.000000 sys 0.860000 (best of 3) ! bundle2 part seek() ! wall 6.267110 comb 6.250000 user 3.480000 sys 2.770000 (best of 3) ! wall 6.213387 comb 6.200000 user 3.350000 sys 2.850000 (best of 3) ! bundle2 part read(8k) ! wall 3.404164 comb 3.400000 user 2.650000 sys 0.750000 (best of 3) ! wall 3.241099 comb 3.250000 user 2.560000 sys 0.690000 (best of 3) ! bundle2 part read(16k) ! wall 3.197972 comb 3.200000 user 2.490000 sys 0.710000 (best of 4) ! wall 3.003930 comb 3.000000 user 2.270000 sys 0.730000 (best of 4) ! bundle2 part read(32k) ! wall 3.060557 comb 3.060000 user 2.340000 sys 0.720000 (best of 4) ! wall 2.904695 comb 2.900000 user 2.160000 sys 0.740000 (best of 4) ! bundle2 part read(128k) ! wall 2.952209 comb 2.950000 user 2.230000 sys 0.720000 (best of 4) ! wall 2.776140 comb 2.780000 user 2.070000 sys 0.710000 (best of 4) Profiling now says most remaining time is spent in util.chunkbuffer. I already heavily optimized that data structure several releases ago. So we'll likely get little more performance out of bundle2 reading while still retaining util.chunkbuffer(). Differential Revision: https://phab.mercurial-scm.org/D1393
Mon, 13 Nov 2017 21:48:35 -0800 bundle2: inline changegroup.readexactly()
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 13 Nov 2017 21:48:35 -0800] rev 35118
bundle2: inline changegroup.readexactly() Profiling reveals this loop is pretty tight. Literally any function call elimination can make a big difference. This commit inlines the relatively trivial changegroup.readexactly() method inside the loop. The results with `hg perfbundleread` on a bundle of the Firefox repo speak for themselves: ! read(8k) ! wall 0.679730 comb 0.680000 user 0.140000 sys 0.540000 (best of 15) ! read(16k) ! wall 0.577228 comb 0.570000 user 0.080000 sys 0.490000 (best of 17) ! read(32k) ! wall 0.516060 comb 0.520000 user 0.040000 sys 0.480000 (best of 20) ! read(128k) ! wall 0.496378 comb 0.490000 user 0.010000 sys 0.480000 (best of 20) ! bundle2 iterparts() ! wall 3.460903 comb 3.460000 user 2.760000 sys 0.700000 (best of 3) ! wall 3.056811 comb 3.050000 user 2.340000 sys 0.710000 (best of 4) ! bundle2 iterparts() seekable ! wall 4.312722 comb 4.310000 user 3.480000 sys 0.830000 (best of 3) ! wall 4.007676 comb 4.000000 user 3.170000 sys 0.830000 (best of 3) ! bundle2 part seek() ! wall 6.754764 comb 6.740000 user 3.970000 sys 2.770000 (best of 3) ! wall 6.267110 comb 6.250000 user 3.480000 sys 2.770000 (best of 3) ! bundle2 part read(8k) ! wall 3.668004 comb 3.660000 user 2.960000 sys 0.700000 (best of 3) ! wall 3.404164 comb 3.400000 user 2.650000 sys 0.750000 (best of 3) ! bundle2 part read(16k) ! wall 3.489196 comb 3.480000 user 2.750000 sys 0.730000 (best of 3) ! wall 3.197972 comb 3.200000 user 2.490000 sys 0.710000 (best of 4) ! bundle2 part read(32k) ! wall 3.388569 comb 3.380000 user 2.640000 sys 0.740000 (best of 3) ! wall 3.060557 comb 3.060000 user 2.340000 sys 0.720000 (best of 4) ! bundle2 part read(128k) ! wall 3.276415 comb 3.270000 user 2.560000 sys 0.710000 (best of 4) ! wall 2.952209 comb 2.950000 user 2.230000 sys 0.720000 (best of 4) Differential Revision: https://phab.mercurial-scm.org/D1392
Mon, 13 Nov 2017 22:05:54 -0800 bundle2: inline debug logging
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 13 Nov 2017 22:05:54 -0800] rev 35117
bundle2: inline debug logging Profiling revealed that repeated calls to indebug() were consuming a fair amount of CPU during bundle2 reading, with most of the time spent in ui.configbool(). Inlining indebug() and avoiding extra attribute lookups speeds things up substantially. Using `hg perfbundleread` with a Firefox bundle: ! read(8k) ! wall 0.679730 comb 0.680000 user 0.140000 sys 0.540000 (best of 15) ! read(16k) ! wall 0.577228 comb 0.570000 user 0.080000 sys 0.490000 (best of 17) ! read(32k) ! wall 0.516060 comb 0.520000 user 0.040000 sys 0.480000 (best of 20) ! read(128k) ! wall 0.496378 comb 0.490000 user 0.010000 sys 0.480000 (best of 20) ! bundle2 iterparts() ! wall 6.983756 comb 6.980000 user 6.220000 sys 0.760000 (best of 3) ! wall 3.460903 comb 3.460000 user 2.760000 sys 0.700000 (best of 3) ! bundle2 iterparts() seekable ! wall 8.132131 comb 8.110000 user 7.160000 sys 0.950000 (best of 3) ! wall 4.312722 comb 4.310000 user 3.480000 sys 0.830000 (best of 3) ! bundle2 part seek() ! wall 10.860942 comb 10.840000 user 7.790000 sys 3.050000 (best of 3) ! wall 6.754764 comb 6.740000 user 3.970000 sys 2.770000 (best of 3) ! bundle2 part read(8k) ! wall 7.258035 comb 7.260000 user 6.470000 sys 0.790000 (best of 3) ! wall 3.668004 comb 3.660000 user 2.960000 sys 0.700000 (best of 3) ! bundle2 part read(16k) ! wall 7.099891 comb 7.080000 user 6.310000 sys 0.770000 (best of 3) ! wall 3.489196 comb 3.480000 user 2.750000 sys 0.730000 (best of 3) ! bundle2 part read(32k) ! wall 6.964685 comb 6.950000 user 6.130000 sys 0.820000 (best of 3) ! wall 3.388569 comb 3.380000 user 2.640000 sys 0.740000 (best of 3) ! bundle2 part read(128k) ! wall 6.852867 comb 6.850000 user 6.060000 sys 0.790000 (best of 3) ! wall 3.276415 comb 3.270000 user 2.560000 sys 0.710000 (best of 4) Differential Revision: https://phab.mercurial-scm.org/D1391
Mon, 13 Nov 2017 21:10:37 -0800 bundle2: don't use seekable bundle2 parts by default (issue5691)
Gregory Szorc <gregory.szorc@gmail.com> [Mon, 13 Nov 2017 21:10:37 -0800] rev 35116
bundle2: don't use seekable bundle2 parts by default (issue5691) The last commit removed the last use of the bundle2 part seek() API in the generic bundle2 part iteration code. This means we can now switch to using unseekable bundle2 parts by default and have the special consumers that actually need the behavior request it. This commit changes unbundle20.iterparts() to expose non-seekable unbundlepart instances by default. If seekable parts are needed, callers can pass "seekable=True." The bundlerepo class needs seekable parts, so it does this. The interrupt handler is also changed to use a regular unbundlepart. So, by default, all consumers except bundlerepo will see unseekable parts. Because the behavior of the iterparts() benchmark changed, we add a variation to test seekable parts vs unseekable parts. And because parts no longer have seek() unless "seekable=True," we update the "part seek" benchmark. Speaking of benchmarks, this change has the following impact to `hg perfbundleread` on an uncompressed bundle of the Firefox repo (6,070,036,163 bytes): ! read(8k) ! wall 0.722709 comb 0.720000 user 0.150000 sys 0.570000 (best of 14) ! read(16k) ! wall 0.602208 comb 0.590000 user 0.080000 sys 0.510000 (best of 17) ! read(32k) ! wall 0.554018 comb 0.560000 user 0.050000 sys 0.510000 (best of 18) ! read(128k) ! wall 0.520086 comb 0.530000 user 0.020000 sys 0.510000 (best of 20) ! bundle2 forwardchunks() ! wall 2.996329 comb 3.000000 user 2.300000 sys 0.700000 (best of 4) ! bundle2 iterparts() ! wall 8.070791 comb 8.060000 user 7.180000 sys 0.880000 (best of 3) ! wall 6.983756 comb 6.980000 user 6.220000 sys 0.760000 (best of 3) ! bundle2 iterparts() seekable ! wall 8.132131 comb 8.110000 user 7.160000 sys 0.950000 (best of 3) ! bundle2 part seek() ! wall 10.370142 comb 10.350000 user 7.430000 sys 2.920000 (best of 3) ! wall 10.860942 comb 10.840000 user 7.790000 sys 3.050000 (best of 3) ! bundle2 part read(8k) ! wall 8.599892 comb 8.580000 user 7.720000 sys 0.860000 (best of 3) ! wall 7.258035 comb 7.260000 user 6.470000 sys 0.790000 (best of 3) ! bundle2 part read(16k) ! wall 8.265361 comb 8.250000 user 7.360000 sys 0.890000 (best of 3) ! wall 7.099891 comb 7.080000 user 6.310000 sys 0.770000 (best of 3) ! bundle2 part read(32k) ! wall 8.290308 comb 8.280000 user 7.330000 sys 0.950000 (best of 3) ! wall 6.964685 comb 6.950000 user 6.130000 sys 0.820000 (best of 3) ! bundle2 part read(128k) ! wall 8.204900 comb 8.150000 user 7.210000 sys 0.940000 (best of 3) ! wall 6.852867 comb 6.850000 user 6.060000 sys 0.790000 (best of 3) The significant speedup is due to not incurring the overhead to track payload offset data. Of course, this overhead is proportional to bundle2 part size. So a multiple gigabyte changegroup part is on the extreme side of the spectrum for real-world impact. In addition to the CPU efficiency wins, not tracking offset data also means not using memory to hold that data. Using a bundle based on the example BSD repository in issue 5691, this change has a drastic impact to memory usage during `hg unbundle` (`hg clone` would behave similarly). Before, memory usage incrementally increased for the duration of bundle processing. In other words, as we advanced through the changegroup and bundle2 part, we kept allocating more memory to hold offset data. After this change, we still increase memory during changegroup application. But the rate of increase is significantly slower. (A bulk of the remaining gradual increase appears to be the storing of revlog sizes in the transaction object to facilitate rollback.) The RSS at the end of filelog application is as follows: Before: ~752 MB After: ~567 MB So, we were storing ~185 MB of offset data that we never even used. Talk about wasteful! .. api:: bundle2 parts are no longer seekable by default. .. perf:: bundle2 read I/O throughput significantly increased. .. perf:: Significant memory use reductions when reading from bundle2 bundles. On the BSD repository, peak RSS during changegroup application decreased by ~185 MB from ~752 MB to ~567 MB. Differential Revision: https://phab.mercurial-scm.org/D1390
(0) -30000 -10000 -3000 -1000 -300 -100 -30 -10 -4 +4 +10 +30 +100 +300 +1000 +3000 +10000 tip