deltas: set estimated compression upper bound to "3x" instead of "10x"
In pratice, we very rarely observer compression better than "3x" on manifest
deltas. Having a more aggressive estimate significantly helps our pathological
use case on a private repository. Here are a comparison of timings using
different upper bound.
Estimated compression | ø | ×10 | ×5 | ×3 |
timing | 14.11 | 2.61 | 1.96 | 1.53 |
We also tested the impact of this series on an array of public repositories.
This shown no impact in either size nor timing.
Full data set below for those interested.
Size
----
Regarding size, not significant impact have been noticed on neither public nor
private repositories. Here are the number we gathered on public repositories:
zlib/upperbound | no | 10x | 5x | 3x
mercurial | 5 875 730 | 5 875 730 | 5 875 730 | 5 875 730
pypy | 27 782 913 | 27 782 913 | 27 782 913 | 27 782 913
netbeans | 159 161 207 | 159 161 207 | 159 161 207 | 159 959 879 (+0.5%)
mozilla-central | 323 841 642 | 323 841 642 | 323 841 642 | 319 867 519 (-2.5%)
mozilla-try | 746 649 123 | 746 649 123 | 746 649 123 | 741 155 568 (-0.7%)
private-repo | 1 485 287 294 | 1 485 287 294 | 1 485 287 294 | 1 409 248 382 (-5.1%)
zstd/upperbound | no | 10x | 5x | 3x
mercurial | 5 895 206 | 5 895 206 | 5 895 206 | 5 895 206
pypy | 28 689 230 | 28 689 230 | 28 689 230 | 28 689 230
netbeans | 157 636 387 | 157 636 387 | 157 636 387 | 159 692 678 (+1.3%)
mozilla-central | 317 650 281 | 317 650 281 | 317 650 281 | 319 613 603 (+0.6%)
mozilla-try | 737 555 275 | 737 555 275 | 737 555 275 | 738 079 473 (+0.1%)
private-repo | 1 352 362 982 | 1 352 362 982 | 1 346 961 880 | 1 361 327 384 (+0.7%)
Speed
------
Timing gathered using `hg perfrevlogwrite -m`. Value are in seconds.
mercurial
zlib | no | 10x | 5x | 3x |
total | 65.551783 | 65.388887 | 65.260658 | 65.321199 |
max | 0.034544 | 0.034571 | 0.034659 | 0.034521 |
99.99% | 0.034544 | 0.034571 | 0.034659 | 0.034521 |
zstd | no | 10x | 5x | 3x |
total | 49.118449 | 49.054062 | 48.753588 | 48.740230 |
max | 0.009338 | 0.009239 | 0.009202 | 0.009178 |
99.99% | 0.007618 | 0.007639 | 0.007626 | 0.007621 |
pypy
zlib | no | 10x | 5x | 3x |
total | 560.865984 | 558.983817 | 559.083815 | 559.349152 |
max | 0.219614 | 0.215922 | 0.218112 | 0.218107 |
99.99% | 0.219614 | 0.215922 | 0.218112 | 0.218107 |
zstd | no | 10x | 5x | 3x |
total | 349.393280 | 347.395819 | 347.185407 | 345.643985 |
max | 0.084143 | 0.083536 | 0.081834 | 0.082178 |
99.99% | 0.039445 | 0.039639 | 0.039612 | 0.039175 |
netbeans
zlib | no | 10x | 5x | 3x |
total | 33103.327727 | 33314.932260 | 33211.745233 | 33345.891778 |
max | 2.666852 | 2.672059 | 2.662453 | 2.662936 |
99.99% | 2.058772 | 2.070429 | 2.069569 | 2.064653 |
zstd | no | 10x | 5x | 3x |
total | 20112.102708 | 20095.879719 | 20083.390300 | 20123.221859 |
max | 2.063482 | 2.062851 | 2.065229 | 2.060147 |
99.99% | 1.146647 | 1.143794 | 1.142933 | 1.146529 |
mozilla
zlib | no | 10x | 5x | 3x |
total | 41374.102138 | 41418.816773 | 41381.956370 | 41334.280732 |
max | 3.383474 | 3.387400 | 3.405711 | 3.387316 |
99.99% | 1.006755 | 1.005954 | 1.007700 | 1.007373 |
zstd | no | 10x | 5x | 3x |
total | 24689.691520 | 24643.939662 | 24664.630027 | 24664.512714 |
max | 1.460822 | 1.449640 | 1.439747 | 1.465304 |
99.99% | 0.527111 | 0.527377 | 0.527807 | 0.527226 |
#!/usr/bin/env python
#
# posplit - split messages in paragraphs on .po/.pot files
#
# license: MIT/X11/Expat
#
from __future__ import absolute_import, print_function
import polib
import re
import sys
def addentry(po, entry, cache):
e = cache.get(entry.msgid)
if e:
e.occurrences.extend(entry.occurrences)
# merge comments from entry
for comment in entry.comment.split('\n'):
if comment and comment not in e.comment:
if not e.comment:
e.comment = comment
else:
e.comment += '\n' + comment
else:
po.append(entry)
cache[entry.msgid] = entry
def mkentry(orig, delta, msgid, msgstr):
entry = polib.POEntry()
entry.merge(orig)
entry.msgid = msgid or orig.msgid
entry.msgstr = msgstr or orig.msgstr
entry.occurrences = [(p, int(l) + delta) for (p, l) in orig.occurrences]
return entry
if __name__ == "__main__":
po = polib.pofile(sys.argv[1])
cache = {}
entries = po[:]
po[:] = []
findd = re.compile(r' *\.\. (\w+)::') # for finding directives
for entry in entries:
msgids = entry.msgid.split(u'\n\n')
if entry.msgstr:
msgstrs = entry.msgstr.split(u'\n\n')
else:
msgstrs = [u''] * len(msgids)
if len(msgids) != len(msgstrs):
# places the whole existing translation as a fuzzy
# translation for each paragraph, to give the
# translator a chance to recover part of the old
# translation - erasing extra paragraphs is
# probably better than retranslating all from start
if 'fuzzy' not in entry.flags:
entry.flags.append('fuzzy')
msgstrs = [entry.msgstr] * len(msgids)
delta = 0
for msgid, msgstr in zip(msgids, msgstrs):
if msgid and msgid != '::':
newentry = mkentry(entry, delta, msgid, msgstr)
mdirective = findd.match(msgid)
if mdirective:
if not msgid[mdirective.end():].rstrip():
# only directive, nothing to translate here
delta += 2
continue
directive = mdirective.group(1)
if directive in ('container', 'include'):
if msgid.rstrip('\n').count('\n') == 0:
# only rst syntax, nothing to translate
delta += 2
continue
else:
# lines following directly, unexpected
print('Warning: text follows line with directive'
' %s' % directive)
comment = 'do not translate: .. %s::' % directive
if not newentry.comment:
newentry.comment = comment
elif comment not in newentry.comment:
newentry.comment += '\n' + comment
addentry(po, newentry, cache)
delta += 2 + msgid.count('\n')
po.save()