encoding: re-escape U+DCxx characters in toutf8b input (
issue4927)
This is the final missing piece in fully round-tripping random byte
strings through UTF-8b. While this issue means that UTF-8 <-> UTF-8b
isn't fully bijective, we don't expect to ever see U+DCxx codepoints
in "real" UTF-8 data, so it should remain bijective in practice.
encoding: use getutf8char in toutf8b
This correctly avoids the ambiguity of U+FFFD already present in the
input and similar confusion by working a character at a time.
posix: use getutf8char to handle OS X filename percent-escaping
This replaces an open-coded utf-8 parser that was ignoring subtle issues
like overlong encodings.