changeset 29015:87d4a6c5567e stable

bdiff: further restrain potential quadratic performance This causes the longest_match search to limit itself to a window of 30000 lines during search (roughly 1MB), thus avoiding a full O(N*M) search that might occur in repetitive structured inputs. For a particular class of many MB pathological test cases, this generated the following timings: size before after 10x 1.25s 1.24s 100x 57s 33s 1000x >8400s 400s The times on the right quickly become much faster and appear more linear. While windowing means deltas are no longer "optimal", the resulting deltas were within a couple percent of expected size. While we've yet to have a report of a file with the level of repetition necessary to hit this case, some JSON/XML database dump scenario is fairly likely to hit it. This may also slightly improve the average-case performance for deltas of large binaries.
author Matt Mackall <mpm@selenic.com>
date Fri, 22 Apr 2016 13:38:02 -0500
parents f1ca249696ed
children 94451300f3ec
files mercurial/bdiff.c
diffstat 1 files changed, 9 insertions(+), 1 deletions(-) [+]
line wrap: on
line diff
--- a/mercurial/bdiff.c	Thu Apr 21 22:04:11 2016 -0500
+++ b/mercurial/bdiff.c	Fri Apr 22 13:38:02 2016 -0500
@@ -148,7 +148,15 @@
 static int longest_match(struct line *a, struct line *b, struct pos *pos,
 			 int a1, int a2, int b1, int b2, int *omi, int *omj)
 {
-	int mi = a1, mj = b1, mk = 0, mb = 0, i, j, k, half = (a1 + a2) / 2;
+	int mi = a1, mj = b1, mk = 0, mb = 0, i, j, k, half;
+
+	/* window our search on large regions to better bound
+	   worst-case performance. by choosing a window at the end, we
+	   reduce skipping overhead on the b chains. */
+	if (a2 - a1 > 30000)
+		a1 = a2 - 30000;
+
+	half = (a1 + a2) / 2;
 
 	for (i = a1; i < a2; i++) {
 		/* skip all lines in b after the current block */