comparison mercurial/pycompat.py @ 44998:f2de8f31cb59

pycompat: use os.fsencode() to re-encode sys.argv Historically, the previous code made sense, as Py_EncodeLocale() and fs.fsencode() could possibly use different encodings. However, this is not the case anymore for Python 3.2, which uses the locale encoding as the filesystem encoding (this is not true for later Python versions, but see below). See https://vstinner.github.io/painful-history-python-filesystem-encoding.html for a source and more background information. Using os.fsencode() is safer, as the documentation for sys.argv says that it can be used to get the original bytes. When doing further changes, the Python developers will take care that this continues to work. One concrete case where os.fsencode() is more correct is when enabling Python's UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our previous code would have encoded it using the locale encoding (which might be different), whereas os.fsencode() will encode it with UTF-8. Since we don’t claim to support the UTF-8 mode, this is not really a bug and the patch can go to the default branch. It might be a good idea to not commit this to the stable branch, as it could in theory introduce regressions.
author Manuel Jacob <me@manueljacob.de>
date Wed, 24 Jun 2020 14:44:21 +0200
parents afcad425a0b6
children a25343d16ebe
comparison
equal deleted inserted replaced
44997:93aa152d4295 44998:f2de8f31cb59
96 if ispy3: 96 if ispy3:
97 import builtins 97 import builtins
98 import codecs 98 import codecs
99 import functools 99 import functools
100 import io 100 import io
101 import locale
102 import struct 101 import struct
103 102
104 if os.name == r'nt' and sys.version_info >= (3, 6): 103 if os.name == r'nt' and sys.version_info >= (3, 6):
105 # MBCS (or ANSI) filesystem encoding must be used as before. 104 # MBCS (or ANSI) filesystem encoding must be used as before.
106 # Otherwise non-ASCII filenames in existing repositories would be 105 # Otherwise non-ASCII filenames in existing repositories would be
154 stdout = sys.stdout.buffer 153 stdout = sys.stdout.buffer
155 stderr = sys.stderr.buffer 154 stderr = sys.stderr.buffer
156 155
157 if getattr(sys, 'argv', None) is not None: 156 if getattr(sys, 'argv', None) is not None:
158 # On POSIX, the char** argv array is converted to Python str using 157 # On POSIX, the char** argv array is converted to Python str using
159 # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which isn't 158 # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which
160 # directly callable from Python code. So, we need to emulate it. 159 # isn't directly callable from Python code. In practice, os.fsencode()
161 # Py_DecodeLocale() calls mbstowcs() and falls back to mbrtowc() with 160 # can be used instead (this is recommended by Python's documentation
162 # surrogateescape error handling on failure. These functions take the 161 # for sys.argv).
163 # current system locale into account. So, the inverse operation is to
164 # .encode() using the system locale's encoding and using the
165 # surrogateescape error handler. The only tricky part here is getting
166 # the system encoding correct, since `locale.getlocale()` can return
167 # None. We fall back to the filesystem encoding if lookups via `locale`
168 # fail, as this seems like a reasonable thing to do.
169 # 162 #
170 # On Windows, the wchar_t **argv is passed into the interpreter as-is. 163 # On Windows, the wchar_t **argv is passed into the interpreter as-is.
171 # Like POSIX, we need to emulate what Py_EncodeLocale() would do. But 164 # Like POSIX, we need to emulate what Py_EncodeLocale() would do. But
172 # there's an additional wrinkle. What we really want to access is the 165 # there's an additional wrinkle. What we really want to access is the
173 # ANSI codepage representation of the arguments, as this is what 166 # ANSI codepage representation of the arguments, as this is what
176 # encoding, which will pass CP_ACP to the underlying Windows API to 169 # encoding, which will pass CP_ACP to the underlying Windows API to
177 # produce bytes. 170 # produce bytes.
178 if os.name == r'nt': 171 if os.name == r'nt':
179 sysargv = [a.encode("mbcs", "ignore") for a in sys.argv] 172 sysargv = [a.encode("mbcs", "ignore") for a in sys.argv]
180 else: 173 else:
181 174 sysargv = [fsencode(a) for a in sys.argv]
182 def getdefaultlocale_if_known():
183 try:
184 return locale.getdefaultlocale()
185 except ValueError:
186 return None, None
187
188 encoding = (
189 locale.getlocale()[1]
190 or getdefaultlocale_if_known()[1]
191 or sys.getfilesystemencoding()
192 )
193 sysargv = [a.encode(encoding, "surrogateescape") for a in sys.argv]
194 175
195 bytechr = struct.Struct('>B').pack 176 bytechr = struct.Struct('>B').pack
196 byterepr = b'%r'.__mod__ 177 byterepr = b'%r'.__mod__
197 178
198 class bytestr(bytes): 179 class bytestr(bytes):