Mercurial > hg
comparison mercurial/pycompat.py @ 44998:f2de8f31cb59
pycompat: use os.fsencode() to re-encode sys.argv
Historically, the previous code made sense, as Py_EncodeLocale() and
fs.fsencode() could possibly use different encodings. However, this is not the
case anymore for Python 3.2, which uses the locale encoding as the filesystem
encoding (this is not true for later Python versions, but see below). See
https://vstinner.github.io/painful-history-python-filesystem-encoding.html for
a source and more background information.
Using os.fsencode() is safer, as the documentation for sys.argv says that it can
be used to get the original bytes. When doing further changes, the Python
developers will take care that this continues to work.
One concrete case where os.fsencode() is more correct is when enabling Python's
UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our previous code
would have encoded it using the locale encoding (which might be different),
whereas os.fsencode() will encode it with UTF-8.
Since we don’t claim to support the UTF-8 mode, this is not really a bug and the
patch can go to the default branch. It might be a good idea to not commit this
to the stable branch, as it could in theory introduce regressions.
author | Manuel Jacob <me@manueljacob.de> |
---|---|
date | Wed, 24 Jun 2020 14:44:21 +0200 |
parents | afcad425a0b6 |
children | a25343d16ebe |
comparison
equal
deleted
inserted
replaced
44997:93aa152d4295 | 44998:f2de8f31cb59 |
---|---|
96 if ispy3: | 96 if ispy3: |
97 import builtins | 97 import builtins |
98 import codecs | 98 import codecs |
99 import functools | 99 import functools |
100 import io | 100 import io |
101 import locale | |
102 import struct | 101 import struct |
103 | 102 |
104 if os.name == r'nt' and sys.version_info >= (3, 6): | 103 if os.name == r'nt' and sys.version_info >= (3, 6): |
105 # MBCS (or ANSI) filesystem encoding must be used as before. | 104 # MBCS (or ANSI) filesystem encoding must be used as before. |
106 # Otherwise non-ASCII filenames in existing repositories would be | 105 # Otherwise non-ASCII filenames in existing repositories would be |
154 stdout = sys.stdout.buffer | 153 stdout = sys.stdout.buffer |
155 stderr = sys.stderr.buffer | 154 stderr = sys.stderr.buffer |
156 | 155 |
157 if getattr(sys, 'argv', None) is not None: | 156 if getattr(sys, 'argv', None) is not None: |
158 # On POSIX, the char** argv array is converted to Python str using | 157 # On POSIX, the char** argv array is converted to Python str using |
159 # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which isn't | 158 # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which |
160 # directly callable from Python code. So, we need to emulate it. | 159 # isn't directly callable from Python code. In practice, os.fsencode() |
161 # Py_DecodeLocale() calls mbstowcs() and falls back to mbrtowc() with | 160 # can be used instead (this is recommended by Python's documentation |
162 # surrogateescape error handling on failure. These functions take the | 161 # for sys.argv). |
163 # current system locale into account. So, the inverse operation is to | |
164 # .encode() using the system locale's encoding and using the | |
165 # surrogateescape error handler. The only tricky part here is getting | |
166 # the system encoding correct, since `locale.getlocale()` can return | |
167 # None. We fall back to the filesystem encoding if lookups via `locale` | |
168 # fail, as this seems like a reasonable thing to do. | |
169 # | 162 # |
170 # On Windows, the wchar_t **argv is passed into the interpreter as-is. | 163 # On Windows, the wchar_t **argv is passed into the interpreter as-is. |
171 # Like POSIX, we need to emulate what Py_EncodeLocale() would do. But | 164 # Like POSIX, we need to emulate what Py_EncodeLocale() would do. But |
172 # there's an additional wrinkle. What we really want to access is the | 165 # there's an additional wrinkle. What we really want to access is the |
173 # ANSI codepage representation of the arguments, as this is what | 166 # ANSI codepage representation of the arguments, as this is what |
176 # encoding, which will pass CP_ACP to the underlying Windows API to | 169 # encoding, which will pass CP_ACP to the underlying Windows API to |
177 # produce bytes. | 170 # produce bytes. |
178 if os.name == r'nt': | 171 if os.name == r'nt': |
179 sysargv = [a.encode("mbcs", "ignore") for a in sys.argv] | 172 sysargv = [a.encode("mbcs", "ignore") for a in sys.argv] |
180 else: | 173 else: |
181 | 174 sysargv = [fsencode(a) for a in sys.argv] |
182 def getdefaultlocale_if_known(): | |
183 try: | |
184 return locale.getdefaultlocale() | |
185 except ValueError: | |
186 return None, None | |
187 | |
188 encoding = ( | |
189 locale.getlocale()[1] | |
190 or getdefaultlocale_if_known()[1] | |
191 or sys.getfilesystemencoding() | |
192 ) | |
193 sysargv = [a.encode(encoding, "surrogateescape") for a in sys.argv] | |
194 | 175 |
195 bytechr = struct.Struct('>B').pack | 176 bytechr = struct.Struct('>B').pack |
196 byterepr = b'%r'.__mod__ | 177 byterepr = b'%r'.__mod__ |
197 | 178 |
198 class bytestr(bytes): | 179 class bytestr(bytes): |