Mercurial: mercurial/encoding.py annotate

annotate mercurial/encoding.py @ 22778:80f2b63dd83a

parsers: add a function to efficiently lowercase ASCII strings We need a way to efficiently lowercase ASCII strings. For example, 'hg status' needs to build up the fold map -- a map from a canonical case (for OS X, lowercase) to the actual case of each file and directory in the dirstate. The current way we do that is to try decoding to ASCII and then calling lower() on the string, labeled 'orig' below: str.decode('ascii') return str.lower() This is pretty inefficient, and it turns out we can do much better. I also tested out a condition-based approach, labeled 'cond' below: (c >= 'A' && c <= 'Z') ? (c + ('a' - 'A')) : c 'cond' turned out to be slower in all cases. A 256-byte lookup table with invalid values for everything past 127 performed similarly, but this was less verbose. On OS X 10.9 with LLVM version 6.0 (clang-600.0.51), the asciilower function was run against two corpuses. Corpus 1 (list of files from real-world repo, > 100k files): orig: wall 0.428567 comb 0.430000 user 0.430000 sys 0.000000 (best of 24) cond: wall 0.077204 comb 0.070000 user 0.070000 sys 0.000000 (best of 100) lookup: wall 0.060714 comb 0.060000 user 0.060000 sys 0.000000 (best of 100) Corpus 2 (mozilla-central, 113k files): orig: wall 0.238406 comb 0.240000 user 0.240000 sys 0.000000 (best of 42) cond: wall 0.040779 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) lookup: wall 0.037623 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) On a Linux server-class machine with GCC 4.4.6 20120305 (Red Hat 4.4.6-4): Corpus 1 (real-world repo, > 100k files): orig: wall 0.260899 comb 0.260000 user 0.260000 sys 0.000000 (best of 38) cond: wall 0.054818 comb 0.060000 user 0.060000 sys 0.000000 (best of 100) lookup: wall 0.048489 comb 0.050000 user 0.050000 sys 0.000000 (best of 100) Corpus 2 (mozilla-central, 113k files): orig: wall 0.153082 comb 0.150000 user 0.150000 sys 0.000000 (best of 65) cond: wall 0.031007 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) lookup: wall 0.028793 comb 0.030000 user 0.030000 sys 0.000000 (best of 100) SSE instructions might help even more, but I didn't experiment with those.

author	Siddharth Agarwal <sid0@fb.com>
date	Fri, 03 Oct 2014 18:42:39 -0700
parents	f6b533e64ed6
children	d9585dda63c3

rev	line source
8226 8b2cd04a6e97 put license and copyright info into comment blocks Martin Geisler <mg@lazybytes.net> parents: 8225 diff changeset	1 # encoding.py - character transcoding support for Mercurial
8b2cd04a6e97 put license and copyright info into comment blocks Martin Geisler <mg@lazybytes.net> parents: 8225 diff changeset	2 #
8b2cd04a6e97 put license and copyright info into comment blocks Martin Geisler <mg@lazybytes.net> parents: 8225 diff changeset	3 # Copyright 2005-2009 Matt Mackall <mpm@selenic.com> and others
8b2cd04a6e97 put license and copyright info into comment blocks Martin Geisler <mg@lazybytes.net> parents: 8225 diff changeset	4 #
8b2cd04a6e97 put license and copyright info into comment blocks Martin Geisler <mg@lazybytes.net> parents: 8225 diff changeset	5 # This software may be used and distributed according to the terms of the
10263 25e572394f5c Update license to GPLv2+ Matt Mackall <mpm@selenic.com> parents: 9574 diff changeset	6 # GNU General Public License version 2 or any later version.
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	7
22778 80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	8 import error, parsers
12062 c327bfa5e831 cleanup: remove unused imports Brodie Rao <brodie@bitheap.org> parents: 11892 diff changeset	9 import unicodedata, locale, os
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	10
11892 2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	11 def _getpreferredencoding():
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	12 '''
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	13 On darwin, getpreferredencoding ignores the locale environment and
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	14 always returns mac-roman. http://bugs.python.org/issue6202 fixes this
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	15 for Python 2.7 and up. This is the same corrected code for earlier
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	16 Python versions.
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	17
12770 614f0d8724ab check-code: find trailing whitespace Martin Geisler <mg@lazybytes.net> parents: 12062 diff changeset	18 However, we can't use a version check for this method, as some distributions
11892 2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	19 patch Python to fix this. Instead, we use it as a 'fixer' for the mac-roman
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	20 encoding, as it is unlikely that this encoding is the actually expected.
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	21 '''
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	22 try:
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	23 locale.CODESET
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	24 except AttributeError:
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	25 # Fall back to parsing environment variables :-(
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	26 return locale.getdefaultlocale()[1]
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	27
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	28 oldloc = locale.setlocale(locale.LC_CTYPE)
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	29 locale.setlocale(locale.LC_CTYPE, "")
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	30 result = locale.nl_langinfo(locale.CODESET)
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	31 locale.setlocale(locale.LC_CTYPE, oldloc)
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	32
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	33 return result
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	34
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	35 _encodingfixers = {
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	36 '646': lambda: 'ascii',
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	37 'ANSI_X3.4-1968': lambda: 'ascii',
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	38 'mac-roman': _getpreferredencoding
2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	39 }
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	40
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	41 try:
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	42 encoding = os.environ.get("HGENCODING")
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	43 if not encoding:
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	44 encoding = locale.getpreferredencoding() or 'ascii'
11892 2be70ca17311 encoding: improve handling of buggy getpreferredencoding() on Mac OS X Dan Villiom Podlaski Christiansen <danchr@gmail.com> parents: 11297 diff changeset	45 encoding = _encodingfixers.get(encoding, lambda: encoding)()
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	46 except locale.Error:
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	47 encoding = 'ascii'
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	48 encodingmode = os.environ.get("HGENCODINGMODE", "strict")
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	49 fallbackencoding = 'ISO-8859-1'
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	50
13046 7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	51 class localstr(str):
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	52 '''This class allows strings that are unmodified to be
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	53 round-tripped to the local encoding and back'''
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	54 def __new__(cls, u, l):
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	55 s = str.__new__(cls, l)
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	56 s._utf8 = u
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	57 return s
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	58 def __hash__(self):
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	59 return hash(self._utf8) # avoid collisions in local string space
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	60
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	61 def tolocal(s):
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	62 """
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	63 Convert a string from internal UTF-8 to local encoding
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	64
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	65 All internal strings should be UTF-8 but some repos before the
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	66 implementation of locale support may contain latin1 or possibly
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	67 other character sets. We attempt to decode everything strictly
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	68 using UTF-8, then Latin-1, and failing that, we use UTF-8 and
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	69 replace unknown characters.
13046 7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	70
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	71 The localstr class is used to cache the known UTF-8 encoding of
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	72 strings next to their local representation to allow lossless
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	73 round-trip conversion back to UTF-8.
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	74
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	75 >>> u = 'foo: \\xc3\\xa4' # utf-8
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	76 >>> l = tolocal(u)
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	77 >>> l
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	78 'foo: ?'
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	79 >>> fromlocal(l)
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	80 'foo: \\xc3\\xa4'
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	81 >>> u2 = 'foo: \\xc3\\xa1'
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	82 >>> d = { l: 1, tolocal(u2): 2 }
18378 404feac78b8a tests: stabilize doctest output Mads Kiilerich <mads@kiilerich.com> parents: 17424 diff changeset	83 >>> len(d) # no collision
404feac78b8a tests: stabilize doctest output Mads Kiilerich <mads@kiilerich.com> parents: 17424 diff changeset	84 2
13046 7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	85 >>> 'foo: ?' in d
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	86 False
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	87 >>> l1 = 'foo: \\xe4' # historical latin1 fallback
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	88 >>> l = tolocal(l1)
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	89 >>> l
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	90 'foo: ?'
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	91 >>> fromlocal(l) # magically in utf-8
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	92 'foo: \\xc3\\xa4'
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	93 """
13046 7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	94
16274 5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	95 try:
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	96 try:
16274 5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	97 # make sure string is actually stored in UTF-8
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	98 u = s.decode('UTF-8')
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	99 if encoding == 'UTF-8':
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	100 # fast path
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	101 return s
13940 b7b26e54e37a encoding: avoid localstr when a string can be encoded losslessly (issue2763) Matt Mackall <mpm@selenic.com> parents: 13051 diff changeset	102 r = u.encode(encoding, "replace")
b7b26e54e37a encoding: avoid localstr when a string can be encoded losslessly (issue2763) Matt Mackall <mpm@selenic.com> parents: 13051 diff changeset	103 if u == r.decode(encoding):
b7b26e54e37a encoding: avoid localstr when a string can be encoded losslessly (issue2763) Matt Mackall <mpm@selenic.com> parents: 13051 diff changeset	104 # r is a safe, non-lossy encoding of s
b7b26e54e37a encoding: avoid localstr when a string can be encoded losslessly (issue2763) Matt Mackall <mpm@selenic.com> parents: 13051 diff changeset	105 return r
16274 5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	106 return localstr(s, r)
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	107 except UnicodeDecodeError:
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	108 # we should only get here if we're looking at an ancient changeset
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	109 try:
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	110 u = s.decode(fallbackencoding)
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	111 r = u.encode(encoding, "replace")
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	112 if u == r.decode(encoding):
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	113 # r is a safe, non-lossy encoding of s
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	114 return r
13940 b7b26e54e37a encoding: avoid localstr when a string can be encoded losslessly (issue2763) Matt Mackall <mpm@selenic.com> parents: 13051 diff changeset	115 return localstr(u.encode('UTF-8'), r)
16274 5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	116 except UnicodeDecodeError:
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	117 u = s.decode("utf-8", "replace") # last ditch
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	118 return u.encode(encoding, "replace") # can't round-trip
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	119 except LookupError, k:
5d75eb8568d1 encoding: tune fast-path of tolocal a bit Matt Mackall <mpm@selenic.com> parents: 16133 diff changeset	120 raise error.Abort(k, hint="please check your locale settings")
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	121
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	122 def fromlocal(s):
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	123 """
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	124 Convert a string from the local character encoding to UTF-8
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	125
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	126 We attempt to decode strings using the encoding mode set by
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	127 HGENCODINGMODE, which defaults to 'strict'. In this mode, unknown
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	128 characters will cause an error message. Other modes include
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	129 'replace', which replaces unknown characters with a special
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	130 Unicode character, and 'ignore', which drops the character.
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	131 """
13046 7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	132
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	133 # can we do a lossless round-trip?
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	134 if isinstance(s, localstr):
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	135 return s._utf8
7cc4263e07a9 encoding: add localstr class to track UTF-8 version of transcoded strings Matt Mackall <mpm@selenic.com> parents: 12866 diff changeset	136
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	137 try:
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	138 return s.decode(encoding, encodingmode).encode("utf-8")
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	139 except UnicodeDecodeError, inst:
10282 08a0f04b56bd many, many trivial check-code fixups Matt Mackall <mpm@selenic.com> parents: 10263 diff changeset	140 sub = s[max(0, inst.start - 10):inst.start + 10]
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	141 raise error.Abort("decoding near '%s': %s!" % (sub, inst))
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	142 except LookupError, k:
15769 afdf4f5bac61 encoding: use hint markup for "please check your locale settings" Mads Kiilerich <mads@kiilerich.com> parents: 15672 diff changeset	143 raise error.Abort(k, hint="please check your locale settings")
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	144
12866 eddc20306ab6 encoding: default ambiguous character to narrow Matt Mackall <mpm@selenic.com> parents: 12770 diff changeset	145 # How to treat ambiguous-width characters. Set to 'wide' to treat as wide.
15066 24efa83d81cb i18n: calculate terminal columns by width information of each characters FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 14951 diff changeset	146 wide = (os.environ.get("HGENCODINGAMBIGUOUS", "narrow") == "wide"
24efa83d81cb i18n: calculate terminal columns by width information of each characters FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 14951 diff changeset	147 and "WFA" or "WF")
12866 eddc20306ab6 encoding: default ambiguous character to narrow Matt Mackall <mpm@selenic.com> parents: 12770 diff changeset	148
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	149 def colwidth(s):
15142 176882876780 encoding: colwidth input is in the local encoding Matt Mackall <mpm@selenic.com> parents: 15066 diff changeset	150 "Find the column width of a string for display in the local encoding"
15066 24efa83d81cb i18n: calculate terminal columns by width information of each characters FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 14951 diff changeset	151 return ucolwidth(s.decode(encoding, 'replace'))
24efa83d81cb i18n: calculate terminal columns by width information of each characters FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 14951 diff changeset	152
24efa83d81cb i18n: calculate terminal columns by width information of each characters FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 14951 diff changeset	153 def ucolwidth(d):
24efa83d81cb i18n: calculate terminal columns by width information of each characters FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 14951 diff changeset	154 "Find the column width of a Unicode string for display"
14951 61807854004e encoding: use getattr isntead of hasattr Augie Fackler <durin42@gmail.com> parents: 14069 diff changeset	155 eaw = getattr(unicodedata, 'east_asian_width', None)
61807854004e encoding: use getattr isntead of hasattr Augie Fackler <durin42@gmail.com> parents: 14069 diff changeset	156 if eaw is not None:
61807854004e encoding: use getattr isntead of hasattr Augie Fackler <durin42@gmail.com> parents: 14069 diff changeset	157 return sum([eaw(c) in wide and 2 or 1 for c in d])
7948 de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	158 return len(d)
de377b1a9a84 move encoding bits from util to encoding Matt Mackall <mpm@selenic.com> parents: diff changeset	159
15143 16c129b0f465 encoding: add getcols to extract substrings based on column width Matt Mackall <mpm@selenic.com> parents: 15142 diff changeset	160 def getcols(s, start, c):
16c129b0f465 encoding: add getcols to extract substrings based on column width Matt Mackall <mpm@selenic.com> parents: 15142 diff changeset	161 '''Use colwidth to find a c-column substring of s starting at byte
16c129b0f465 encoding: add getcols to extract substrings based on column width Matt Mackall <mpm@selenic.com> parents: 15142 diff changeset	162 index start'''
16c129b0f465 encoding: add getcols to extract substrings based on column width Matt Mackall <mpm@selenic.com> parents: 15142 diff changeset	163 for x in xrange(start + c, len(s)):
16c129b0f465 encoding: add getcols to extract substrings based on column width Matt Mackall <mpm@selenic.com> parents: 15142 diff changeset	164 t = s[start:x]
16c129b0f465 encoding: add getcols to extract substrings based on column width Matt Mackall <mpm@selenic.com> parents: 15142 diff changeset	165 if colwidth(t) == c:
16c129b0f465 encoding: add getcols to extract substrings based on column width Matt Mackall <mpm@selenic.com> parents: 15142 diff changeset	166 return t
16c129b0f465 encoding: add getcols to extract substrings based on column width Matt Mackall <mpm@selenic.com> parents: 15142 diff changeset	167
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	168 def trim(s, width, ellipsis='', leftside=False):
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	169 """Trim string 's' to at most 'width' columns (including 'ellipsis').
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	170
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	171 If 'leftside' is True, left side of string 's' is trimmed.
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	172 'ellipsis' is always placed at trimmed side.
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	173
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	174 >>> ellipsis = '+++'
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	175 >>> from mercurial import encoding
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	176 >>> encoding.encoding = 'utf-8'
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	177 >>> t= '1234567890'
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	178 >>> print trim(t, 12, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	179 1234567890
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	180 >>> print trim(t, 10, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	181 1234567890
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	182 >>> print trim(t, 8, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	183 12345+++
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	184 >>> print trim(t, 8, ellipsis=ellipsis, leftside=True)
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	185 +++67890
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	186 >>> print trim(t, 8)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	187 12345678
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	188 >>> print trim(t, 8, leftside=True)
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	189 34567890
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	190 >>> print trim(t, 3, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	191 +++
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	192 >>> print trim(t, 1, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	193 +
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	194 >>> u = u'\u3042\u3044\u3046\u3048\u304a' # 2 x 5 = 10 columns
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	195 >>> t = u.encode(encoding.encoding)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	196 >>> print trim(t, 12, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	197 \xe3\x81\x82\xe3\x81\x84\xe3\x81\x86\xe3\x81\x88\xe3\x81\x8a
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	198 >>> print trim(t, 10, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	199 \xe3\x81\x82\xe3\x81\x84\xe3\x81\x86\xe3\x81\x88\xe3\x81\x8a
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	200 >>> print trim(t, 8, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	201 \xe3\x81\x82\xe3\x81\x84+++
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	202 >>> print trim(t, 8, ellipsis=ellipsis, leftside=True)
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	203 +++\xe3\x81\x88\xe3\x81\x8a
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	204 >>> print trim(t, 5)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	205 \xe3\x81\x82\xe3\x81\x84
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	206 >>> print trim(t, 5, leftside=True)
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	207 \xe3\x81\x88\xe3\x81\x8a
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	208 >>> print trim(t, 4, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	209 +++
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	210 >>> print trim(t, 4, ellipsis=ellipsis, leftside=True)
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	211 +++
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	212 >>> t = '\x11\x22\x33\x44\x55\x66\x77\x88\x99\xaa' # invalid byte sequence
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	213 >>> print trim(t, 12, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	214 \x11\x22\x33\x44\x55\x66\x77\x88\x99\xaa
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	215 >>> print trim(t, 10, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	216 \x11\x22\x33\x44\x55\x66\x77\x88\x99\xaa
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	217 >>> print trim(t, 8, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	218 \x11\x22\x33\x44\x55+++
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	219 >>> print trim(t, 8, ellipsis=ellipsis, leftside=True)
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	220 +++\x66\x77\x88\x99\xaa
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	221 >>> print trim(t, 8)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	222 \x11\x22\x33\x44\x55\x66\x77\x88
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	223 >>> print trim(t, 8, leftside=True)
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	224 \x33\x44\x55\x66\x77\x88\x99\xaa
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	225 >>> print trim(t, 3, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	226 +++
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	227 >>> print trim(t, 1, ellipsis=ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	228 +
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	229 """
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	230 try:
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	231 u = s.decode(encoding)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	232 except UnicodeDecodeError:
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	233 if len(s) <= width: # trimming is not needed
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	234 return s
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	235 width -= len(ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	236 if width <= 0: # no enough room even for ellipsis
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	237 return ellipsis[:width + len(ellipsis)]
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	238 if leftside:
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	239 return ellipsis + s[-width:]
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	240 return s[:width] + ellipsis
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	241
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	242 if ucolwidth(u) <= width: # trimming is not needed
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	243 return s
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	244
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	245 width -= len(ellipsis)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	246 if width <= 0: # no enough room even for ellipsis
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	247 return ellipsis[:width + len(ellipsis)]
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	248
21861 b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	249 if leftside:
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	250 uslice = lambda i: u[i:]
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	251 concat = lambda s: ellipsis + s
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	252 else:
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	253 uslice = lambda i: u[:-i]
b515c3a63e96 encoding: add 'leftside' argument into 'trim' to switch trimming side FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 21856 diff changeset	254 concat = lambda s: s + ellipsis
21856 d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	255 for i in xrange(1, len(u)):
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	256 usub = uslice(i)
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	257 if ucolwidth(usub) <= width:
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	258 return concat(usub.encode(encoding))
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	259 return ellipsis # no enough room for multi-column characters
d24969ee272f encoding: add 'trim' to trim multi-byte characters at most specified columns FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 18378 diff changeset	260
22778 80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	261 def asciilower(s):
80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	262 '''convert a string to lowercase if ASCII
80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	263
80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	264 Raises UnicodeDecodeError if non-ASCII characters are found.'''
80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	265 s.decode('ascii')
80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	266 return s.lower()
80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	267
80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	268 asciilower = getattr(parsers, 'asciilower', asciilower)
80f2b63dd83a parsers: add a function to efficiently lowercase ASCII strings Siddharth Agarwal <sid0@fb.com> parents: 22426 diff changeset	269
14069 e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	270 def lower(s):
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	271 "best-effort encoding-aware case-folding of local string s"
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	272 try:
17235 3745ae495ce5 encoding: use s.decode to trigger UnicodeDecodeError Martin Geisler <mg@aragost.com> parents: 16493 diff changeset	273 s.decode('ascii') # throw exception for non-ASCII character
3745ae495ce5 encoding: use s.decode to trigger UnicodeDecodeError Martin Geisler <mg@aragost.com> parents: 16493 diff changeset	274 return s.lower()
3745ae495ce5 encoding: use s.decode to trigger UnicodeDecodeError Martin Geisler <mg@aragost.com> parents: 16493 diff changeset	275 except UnicodeDecodeError:
16387 c481761033bd encoding: add fast-path for ASCII lowercase Matt Mackall <mpm@selenic.com> parents: 16274 diff changeset	276 pass
c481761033bd encoding: add fast-path for ASCII lowercase Matt Mackall <mpm@selenic.com> parents: 16274 diff changeset	277 try:
14069 e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	278 if isinstance(s, localstr):
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	279 u = s._utf8.decode("utf-8")
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	280 else:
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	281 u = s.decode(encoding, encodingmode)
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	282
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	283 lu = u.lower()
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	284 if u == lu:
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	285 return s # preserve localstring
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	286 return lu.encode(encoding)
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	287 except UnicodeError:
e38846a79a23 encoding: add an encoding-aware lower function Matt Mackall <mpm@selenic.com> parents: 13940 diff changeset	288 return s.lower() # we don't know how to fold this except in ASCII
15672 2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	289 except LookupError, k:
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	290 raise error.Abort(k, hint="please check your locale settings")
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	291
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	292 def upper(s):
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	293 "best-effort encoding-aware case-folding of local string s"
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	294 try:
17236 9fb8312dbdbd encoding: add fast-path for ASCII uppercase. Martin Geisler <mg@aragost.com> parents: 17235 diff changeset	295 s.decode('ascii') # throw exception for non-ASCII character
9fb8312dbdbd encoding: add fast-path for ASCII uppercase. Martin Geisler <mg@aragost.com> parents: 17235 diff changeset	296 return s.upper()
9fb8312dbdbd encoding: add fast-path for ASCII uppercase. Martin Geisler <mg@aragost.com> parents: 17235 diff changeset	297 except UnicodeDecodeError:
9fb8312dbdbd encoding: add fast-path for ASCII uppercase. Martin Geisler <mg@aragost.com> parents: 17235 diff changeset	298 pass
9fb8312dbdbd encoding: add fast-path for ASCII uppercase. Martin Geisler <mg@aragost.com> parents: 17235 diff changeset	299 try:
15672 2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	300 if isinstance(s, localstr):
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	301 u = s._utf8.decode("utf-8")
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	302 else:
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	303 u = s.decode(encoding, encodingmode)
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	304
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	305 uu = u.upper()
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	306 if u == uu:
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	307 return s # preserve localstring
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	308 return uu.encode(encoding)
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	309 except UnicodeError:
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	310 return s.upper() # we don't know how to fold this except in ASCII
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	311 except LookupError, k:
2ebe3d0ce91d i18n: use encoding.lower/upper for encoding aware case folding FUJIWARA Katsunori <foozy@lares.dti.ne.jp> parents: 15143 diff changeset	312 raise error.Abort(k, hint="please check your locale settings")
16133 84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	313
22426 f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	314 _jsonmap = {}
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	315
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	316 def jsonescape(s):
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	317 '''returns a string suitable for JSON
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	318
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	319 JSON is problematic for us because it doesn't support non-Unicode
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	320 bytes. To deal with this, we take the following approach:
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	321
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	322 - localstr objects are converted back to UTF-8
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	323 - valid UTF-8/ASCII strings are passed as-is
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	324 - other strings are converted to UTF-8b surrogate encoding
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	325 - apply JSON-specified string escaping
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	326
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	327 (escapes are doubled in these tests)
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	328
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	329 >>> jsonescape('this is a test')
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	330 'this is a test'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	331 >>> jsonescape('escape characters: \\0 \\x0b \\t \\n \\r \\" \\\\')
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	332 'escape characters: \\\\u0000 \\\\u000b \\\\t \\\\n \\\\r \\\\" \\\\\\\\'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	333 >>> jsonescape('a weird byte: \\xdd')
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	334 'a weird byte: \\xed\\xb3\\x9d'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	335 >>> jsonescape('utf-8: caf\\xc3\\xa9')
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	336 'utf-8: caf\\xc3\\xa9'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	337 >>> jsonescape('')
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	338 ''
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	339 '''
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	340
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	341 if not _jsonmap:
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	342 for x in xrange(32):
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	343 _jsonmap[chr(x)] = "\u%04x" %x
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	344 for x in xrange(32, 256):
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	345 c = chr(x)
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	346 _jsonmap[c] = c
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	347 _jsonmap['\t'] = '\\t'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	348 _jsonmap['\n'] = '\\n'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	349 _jsonmap['\"'] = '\\"'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	350 _jsonmap['\\'] = '\\\\'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	351 _jsonmap['\b'] = '\\b'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	352 _jsonmap['\f'] = '\\f'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	353 _jsonmap['\r'] = '\\r'
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	354
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	355 return ''.join(_jsonmap[c] for c in toutf8b(s))
f6b533e64ed6 encoding: add json escaping filter Matt Mackall <mpm@selenic.com> parents: 22425 diff changeset	356
16133 84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	357 def toutf8b(s):
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	358 '''convert a local, possibly-binary string into UTF-8b
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	359
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	360 This is intended as a generic method to preserve data when working
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	361 with schemes like JSON and XML that have no provision for
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	362 arbitrary byte strings. As Mercurial often doesn't know
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	363 what encoding data is in, we use so-called UTF-8b.
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	364
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	365 If a string is already valid UTF-8 (or ASCII), it passes unmodified.
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	366 Otherwise, unsupported bytes are mapped to UTF-16 surrogate range,
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	367 uDC00-uDCFF.
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	368
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	369 Principles of operation:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	370
17424 e7cfe3587ea4 fix trivial spelling errors Mads Kiilerich <mads@kiilerich.com> parents: 17236 diff changeset	371 - ASCII and UTF-8 data successfully round-trips and is understood
16133 84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	372 by Unicode-oriented clients
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	373 - filenames and file contents in arbitrary other encodings can have
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	374 be round-tripped or recovered by clueful clients
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	375 - local strings that have a cached known UTF-8 encoding (aka
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	376 localstr) get sent as UTF-8 so Unicode-oriented clients get the
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	377 Unicode data they want
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	378 - because we must preserve UTF-8 bytestring in places such as
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	379 filenames, metadata can't be roundtripped without help
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	380
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	381 (Note: "UTF-8b" often refers to decoding a mix of valid UTF-8 and
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	382 arbitrary bytes into an internal Unicode format that can be
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	383 re-encoded back into the original. Here we are exposing the
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	384 internal surrogate encoding as a UTF-8 string.)
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	385 '''
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	386
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	387 if isinstance(s, localstr):
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	388 return s._utf8
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	389
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	390 try:
22425 6fd944c204a9 encoding: handle empty string in toutf8 Matt Mackall <mpm@selenic.com> parents: 21861 diff changeset	391 s.decode('utf-8')
6fd944c204a9 encoding: handle empty string in toutf8 Matt Mackall <mpm@selenic.com> parents: 21861 diff changeset	392 return s
16133 84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	393 except UnicodeDecodeError:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	394 # surrogate-encode any characters that don't round-trip
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	395 s2 = s.decode('utf-8', 'ignore').encode('utf-8')
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	396 r = ""
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	397 pos = 0
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	398 for c in s:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	399 if s2[pos:pos + 1] == c:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	400 r += c
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	401 pos += 1
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	402 else:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	403 r += unichr(0xdc00 + ord(c)).encode('utf-8')
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	404 return r
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	405
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	406 def fromutf8b(s):
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	407 '''Given a UTF-8b string, return a local, possibly-binary string.
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	408
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	409 return the original binary string. This
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	410 is a round-trip process for strings like filenames, but metadata
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	411 that's was passed through tolocal will remain in UTF-8.
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	412
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	413 >>> m = "\\xc3\\xa9\\x99abcd"
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	414 >>> n = toutf8b(m)
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	415 >>> n
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	416 '\\xc3\\xa9\\xed\\xb2\\x99abcd'
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	417 >>> fromutf8b(n) == m
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	418 True
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	419 '''
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	420
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	421 # fast path - look for uDxxx prefixes in s
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	422 if "\xed" not in s:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	423 return s
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	424
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	425 u = s.decode("utf-8")
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	426 r = ""
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	427 for c in u:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	428 if ord(c) & 0xff00 == 0xdc00:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	429 r += chr(ord(c) & 0xff)
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	430 else:
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	431 r += c.encode("utf-8")
84c58da3a1f8 encoding: introduce utf8-b helpers Matt Mackall <mpm@selenic.com> parents: 15769 diff changeset	432 return r

Mercurial > hg

annotate mercurial/encoding.py @ 22778:80f2b63dd83a