i18n/posplit
author Gregory Szorc <gregory.szorc@gmail.com>
Mon, 13 Nov 2017 19:20:34 -0800
changeset 35109 e96613048bdd
parent 29153 90d84e1e427a
child 39269 d0e8933d6dad
permissions -rwxr-xr-x
perf: add command to benchmark bundle reading Upcoming commits will be refactoring bundle2 I/O code. This commit establishes a `hg perfbundleread` command that measures how long it takes to read a bundle using various mechanisms. As a baseline, here's output from an uncompressed bundle1 bundle of my Firefox repo (7,098,622,890 bytes): ! read(8k) ! wall 0.763481 comb 0.760000 user 0.160000 sys 0.600000 (best of 6) ! read(16k) ! wall 0.644512 comb 0.640000 user 0.110000 sys 0.530000 (best of 16) ! read(32k) ! wall 0.581172 comb 0.590000 user 0.060000 sys 0.530000 (best of 18) ! read(128k) ! wall 0.535183 comb 0.530000 user 0.010000 sys 0.520000 (best of 19) ! cg1 deltaiter() ! wall 0.873500 comb 0.880000 user 0.840000 sys 0.040000 (best of 12) ! cg1 getchunks() ! wall 6.283797 comb 6.270000 user 5.570000 sys 0.700000 (best of 3) ! cg1 read(8k) ! wall 1.097173 comb 1.100000 user 0.400000 sys 0.700000 (best of 10) ! cg1 read(16k) ! wall 0.810750 comb 0.800000 user 0.200000 sys 0.600000 (best of 13) ! cg1 read(32k) ! wall 0.671215 comb 0.670000 user 0.110000 sys 0.560000 (best of 15) ! cg1 read(128k) ! wall 0.597857 comb 0.600000 user 0.020000 sys 0.580000 (best of 15) And from an uncompressed bundle2 bundle (6,070,036,163 bytes): ! read(8k) ! wall 0.676997 comb 0.680000 user 0.160000 sys 0.520000 (best of 15) ! read(16k) ! wall 0.592706 comb 0.590000 user 0.080000 sys 0.510000 (best of 17) ! read(32k) ! wall 0.529395 comb 0.530000 user 0.050000 sys 0.480000 (best of 16) ! read(128k) ! wall 0.491270 comb 0.490000 user 0.010000 sys 0.480000 (best of 19) ! bundle2 forwardchunks() ! wall 2.997131 comb 2.990000 user 2.270000 sys 0.720000 (best of 4) ! bundle2 iterparts() ! wall 12.247197 comb 10.670000 user 8.170000 sys 2.500000 (best of 3) ! bundle2 part seek() ! wall 11.761675 comb 10.500000 user 8.240000 sys 2.260000 (best of 3) ! bundle2 part read(8k) ! wall 9.116163 comb 9.110000 user 8.240000 sys 0.870000 (best of 3) ! bundle2 part read(16k) ! wall 8.984362 comb 8.970000 user 8.110000 sys 0.860000 (best of 3) ! bundle2 part read(32k) ! wall 8.758364 comb 8.740000 user 7.860000 sys 0.880000 (best of 3) ! bundle2 part read(128k) ! wall 8.749040 comb 8.730000 user 7.830000 sys 0.900000 (best of 3) We already see some interesting data. Notably that bundle2 has significant overhead compared to bundle1. This matters for e.g. stream clone bundles, which can be applied at >1Gbps. Differential Revision: https://phab.mercurial-scm.org/D1385
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
11389
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
     1
#!/usr/bin/env python
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
     2
#
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
     3
# posplit - split messages in paragraphs on .po/.pot files
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
     4
#
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
     5
# license: MIT/X11/Expat
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
     6
#
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
     7
29153
90d84e1e427a py3: make i18n/posplit use print_function
Pulkit Goyal <7895pulkit@gmail.com>
parents: 29152
diff changeset
     8
from __future__ import absolute_import, print_function
29152
c5057b7780dc py3: make i18n/posplit use absolute_import
Pulkit Goyal <7895pulkit@gmail.com>
parents: 28074
diff changeset
     9
c5057b7780dc py3: make i18n/posplit use absolute_import
Pulkit Goyal <7895pulkit@gmail.com>
parents: 28074
diff changeset
    10
import polib
20359
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    11
import re
11389
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    12
import sys
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    13
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    14
def addentry(po, entry, cache):
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    15
    e = cache.get(entry.msgid)
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    16
    if e:
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    17
        e.occurrences.extend(entry.occurrences)
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    18
    else:
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    19
        po.append(entry)
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    20
        cache[entry.msgid] = entry
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    21
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    22
def mkentry(orig, delta, msgid, msgstr):
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    23
    entry = polib.POEntry()
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    24
    entry.merge(orig)
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    25
    entry.msgid = msgid or orig.msgid
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    26
    entry.msgstr = msgstr or orig.msgstr
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    27
    entry.occurrences = [(p, int(l) + delta) for (p, l) in orig.occurrences]
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    28
    return entry
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    29
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    30
if __name__ == "__main__":
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    31
    po = polib.pofile(sys.argv[1])
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    32
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    33
    cache = {}
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    34
    entries = po[:]
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    35
    po[:] = []
20359
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    36
    findd = re.compile(r' *\.\. (\w+)::') # for finding directives
11389
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    37
    for entry in entries:
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    38
        msgids = entry.msgid.split(u'\n\n')
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    39
        if entry.msgstr:
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    40
            msgstrs = entry.msgstr.split(u'\n\n')
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    41
        else:
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    42
            msgstrs = [u''] * len(msgids)
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    43
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    44
        if len(msgids) != len(msgstrs):
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    45
            # places the whole existing translation as a fuzzy
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    46
            # translation for each paragraph, to give the
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    47
            # translator a chance to recover part of the old
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    48
            # translation - erasing extra paragraphs is
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    49
            # probably better than retranslating all from start
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    50
            if 'fuzzy' not in entry.flags:
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    51
                entry.flags.append('fuzzy')
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    52
            msgstrs = [entry.msgstr] * len(msgids)
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    53
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    54
        delta = 0
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    55
        for msgid, msgstr in zip(msgids, msgstrs):
20361
3fe079d3a2b4 i18n: posplit removes the entry "::" from the pot file
Simon Heimberg <simohe@besonet.ch>
parents: 20359
diff changeset
    56
            if msgid and msgid != '::':
11389
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    57
                newentry = mkentry(entry, delta, msgid, msgstr)
20359
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    58
                mdirective = findd.match(msgid)
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    59
                if mdirective:
20362
1bce1078501d i18n: leave out entries which contain only a rst directive
Simon Heimberg <simohe@besonet.ch>
parents: 20361
diff changeset
    60
                    if not msgid[mdirective.end():].rstrip():
1bce1078501d i18n: leave out entries which contain only a rst directive
Simon Heimberg <simohe@besonet.ch>
parents: 20361
diff changeset
    61
                        # only directive, nothing to translate here
28074
a1924bc6e267 i18n: calculate correct line number in source of messages to be translated
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 20363
diff changeset
    62
                        delta += 2
20362
1bce1078501d i18n: leave out entries which contain only a rst directive
Simon Heimberg <simohe@besonet.ch>
parents: 20361
diff changeset
    63
                        continue
20359
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    64
                    directive = mdirective.group(1)
20363
e3ee7ec85a15 i18n: leave out entries which contain only rst syntax
Simon Heimberg <simohe@besonet.ch>
parents: 20362
diff changeset
    65
                    if directive in ('container', 'include'):
e3ee7ec85a15 i18n: leave out entries which contain only rst syntax
Simon Heimberg <simohe@besonet.ch>
parents: 20362
diff changeset
    66
                        if msgid.rstrip('\n').count('\n') == 0:
e3ee7ec85a15 i18n: leave out entries which contain only rst syntax
Simon Heimberg <simohe@besonet.ch>
parents: 20362
diff changeset
    67
                            # only rst syntax, nothing to translate
28074
a1924bc6e267 i18n: calculate correct line number in source of messages to be translated
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 20363
diff changeset
    68
                            delta += 2
20363
e3ee7ec85a15 i18n: leave out entries which contain only rst syntax
Simon Heimberg <simohe@besonet.ch>
parents: 20362
diff changeset
    69
                            continue
e3ee7ec85a15 i18n: leave out entries which contain only rst syntax
Simon Heimberg <simohe@besonet.ch>
parents: 20362
diff changeset
    70
                        else:
e3ee7ec85a15 i18n: leave out entries which contain only rst syntax
Simon Heimberg <simohe@besonet.ch>
parents: 20362
diff changeset
    71
                            # lines following directly, unexpected
29153
90d84e1e427a py3: make i18n/posplit use print_function
Pulkit Goyal <7895pulkit@gmail.com>
parents: 29152
diff changeset
    72
                            print('Warning: text follows line with directive' \
90d84e1e427a py3: make i18n/posplit use print_function
Pulkit Goyal <7895pulkit@gmail.com>
parents: 29152
diff changeset
    73
                                  ' %s' % directive)
20359
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    74
                    comment = 'do not translate: .. %s::' % directive
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    75
                    if not newentry.comment:
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    76
                        newentry.comment = comment
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    77
                    elif comment not in newentry.comment:
ff6ab0b2ebf7 i18n: posplit writes a warning for translators before rst directives
Simon Heimberg <simohe@besonet.ch>
parents: 11389
diff changeset
    78
                        newentry.comment += '\n' + comment
11389
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    79
                addentry(po, newentry, cache)
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    80
            delta += 2 + msgid.count('\n')
4fd49329a1b5 i18n: script for splitting large messages on .po/.pot files
Wagner Bruna <wbruna@yahoo.com>
parents:
diff changeset
    81
    po.save()