changeset 27699:c8d3392f76e1

encoding: handle UTF-16 internal limit with fromutf8b (issue5031) Default builds of Python have a Unicode type that isn't actually full Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP codepoints with surrogate escaping. Since our UTF-8b hack escaping uses a plane that overlaps with the UTF-16 escaping system, this gets extra complicated. In addition, unichr() for codepoints greater than U+FFFF may not work either. This changes the code to reuse getutf8char to walk the byte string, so we only rely on Python for unpacking our U+DCxx characters.
author Matt Mackall <mpm@selenic.com>
date Thu, 07 Jan 2016 14:57:57 -0600
parents dad6404ccddb
children 374fad80ce69
files mercurial/encoding.py
diffstat 1 files changed, 16 insertions(+), 6 deletions(-) [+]
line wrap: on
line diff
--- a/mercurial/encoding.py	Wed Nov 11 21:18:02 2015 -0500
+++ b/mercurial/encoding.py	Thu Jan 07 14:57:57 2016 -0600
@@ -516,17 +516,27 @@
     True
     >>> roundtrip("\\xef\\xef\\xbf\\xbd")
     True
+    >>> roundtrip("\\xf1\\x80\\x80\\x80\\x80")
+    True
     '''
 
     # fast path - look for uDxxx prefixes in s
     if "\xed" not in s:
         return s
 
-    u = s.decode("utf-8")
+    # We could do this with the unicode type but some Python builds
+    # use UTF-16 internally (issue5031) which causes non-BMP code
+    # points to be escaped. Instead, we use our handy getutf8char
+    # helper again to walk the string without "decoding" it.
+
     r = ""
-    for c in u:
-        if ord(c) & 0xffff00 == 0xdc00:
-            r += chr(ord(c) & 0xff)
-        else:
-            r += c.encode("utf-8")
+    pos = 0
+    l = len(s)
+    while pos < l:
+        c = getutf8char(s, pos)
+        pos += len(c)
+        # unescape U+DCxx characters
+        if "\xed\xb0\x80" <= c <= "\xed\xb3\xbf":
+            c = chr(ord(c.decode("utf-8")) & 0xff)
+        r += c
     return r