-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another attempt to fix encoding issues #4486
Conversation
@pypa/pip-committers I'd appreciate reviews of this, if possible. I have no easy means of testing on non-English systems (Unix or Windows), and I know our tests don't exercise situations like that, so this change is at best theoretically OK. I don't think we have any committers using non-English systems, but if anyone can give it a once-over that would be great. |
Made this WIP. See discussions on distutils-sig under "PEP 517 - specifying build system in pyproject.toml" and specifically https://mail.python.org/pipermail/distutils-sig/2017-May/030442.html. In summary, Visual C can produce output with an inconsistent (mixed) encoding, so trying to deal with that (beyond simply not crashing) is pointless. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be nice to incorporate something like https://github.com/danilaml/pip/blob/test-non-ascii-user/appveyor.yml // http://help.appveyor.com/discussions/questions/3530-is-it-possible-to-somehow-specify-the-usernamehostname-with-which-ci-tests-are-run to check for future regressions
pip/compat.py
Outdated
return s.decode(sys.__stdout__.encoding) | ||
except UnicodeDecodeError: | ||
return s.decode('utf_8') | ||
return s.decode(subprocess_encoding(), errors="replace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I'd like pip to first try decoding without error handling and on UnicodeDecodeError
switch to errors="replace"
with a nice warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll work on that.
One further thought. We need the output to be printable, so it must only use characters representable in I believe the required dance is:
Result is a Unicode string that's a "best possible" representation of the subprocess output, assuming the expected encoding, using only characters that can be printed to At the moment, we're only doing this for Python 3. But for Python 2 it's still possible for the output to contain bytes that aren't valid in |
Just pushed an update that more robustly tidies up subprocess output to avoid encoding errors. Includes @xavfernandez suggestion to warn if we can't use the expected encoding without loss, and chooses a default encoding based on discussions on distutils-sig. Also, this version sanitises the data even on Python 2, as it's possible to get Unicode errors on output there, as well. Reviews appreciated, but I'm intending to get this change in for the next version of pip, as even if we get the encoding details slightly wrong, switching to |
That seems reasonable to me, though I'd maybe ask to put it into the compat module to keep the "random crap we can maybe delete as we drop support for Py versions" as contained to one module as we can. |
Regarding For 3.4 and earlier, rather than falling back to Something like:
|
docs/reference/pip.rst
Outdated
requires that the output is written in a well-defined encoding, specifically | ||
the encoding returned by python's ``locale.getpreferredencoding`` function, or | ||
"utf8" if ``getpreferredencoding`` does not return a value (or returns "ASCII", | ||
which ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a complete sentence... :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh. Teach me to rush the doc update :-( I'll revise the docs to fix this and take into account Nick's comments.
docs/reference/pip.rst
Outdated
requires that the output is written in a well-defined encoding, specifically | ||
the encoding returned by python's ``locale.getpreferredencoding`` function, or | ||
"utf8" if ``getpreferredencoding`` does not return a value (or returns "ASCII", | ||
which ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most reliable incantation I know for detecting ASCII as the preferred encoding is: codecs.lookup(locale.getpreferredencoding()).name == "ascii"
The reason for that is that going through the codec system and then looking at the codec name automatically deals with the fact that different platforms may report the same encoding under different aliases. Most importantly for this purpose, some Linux systems report ASCII by its official spec number, 'ANSI_X3.4-1968'.
(The patch itself handles this correctly, but it makes sense to include it here for the benefit of build tool developers as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need that much detail, I'll reword to make the intent clear (that we use UTF-8 if the locale encoding is ASCII).
docs/reference/pip.rst
Outdated
produce output in the correct encoding. In practice - and in particular | ||
on Windows, where tools are inconsistent in their use of the "OEM" and | ||
"ANSI" codepages - this may not always be possible, so pip will attempt to | ||
recover cleanly if presented with incorrectly encoded build tool output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the backslashreplace
discussions, perhaps append ", by translating unexpected byte sequences to Python-style hexadecimal escape sequences (\x80\xff
, etc)."
docs/reference/pip.rst
Outdated
recover cleanly if presented with incorrectly encoded build tool output. | ||
However, pip cannot guarantee in that case that the displayed output will | ||
not be corrupted (mojibake, or characters replaced with the standard | ||
replacement character, often a question mark). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backslashreplace adjustment should be reliable enough that it makes sense to drop this sentence (it's technically still true, but the cases where even backslashreplace is insufficient are obscure enough that I think including it would create more confusion than it clears up)
news/4486.bugfix
Outdated
@@ -0,0 +1 @@ | |||
Improve handling of Unicode output from build tools under Python 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the discussions, "Improve handling of improperly encoded text output from build tools" would be a more accurate description now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving out the phrase "improperly encoded", as it's mostly text that is properly encoded, just not in the same encoding that we're assuming. (There's no standard yet for what encoding setuptools should be using for output).
pip/compat.py
Outdated
logger.warning( | ||
"Subprocess output does not appear to be encoded as %s" % | ||
encoding) | ||
s = data.decode(encoding, errors="replace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the step that would need to conditionally use the native backslashreplace
on 3.5+, and a pip-provided emulation otherwise.
pip/compat.py
Outdated
# that won't fail). | ||
output_encoding = sys.__stderr__.encoding | ||
if output_encoding: | ||
s = s.encode(output_encoding, errors="replace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step can unconditionally use backslashreplace
That looks like a good option - I knew registering a handler was possible, but hadn't looked into how difficult it was. Using the built in handler for 3.5+ and registering our own implementation for older versions seems like a good way to go. |
OK, I think I've dealt with all the review comments. @ncoghlan could you check if you're OK with the reworded docs in particular? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me. Just the one actual bug (the version check), and a couple of minor comments inline.
docs/reference/pip.rst
Outdated
requires that the output is written in a well-defined encoding, specifically | ||
the encoding the user has configured for text output (which can be obtained in | ||
Python using ``locale.getpreferredencoding``). If the configured encoding is | ||
ASCII, pip assumes UTF-8 (to match the behaviour of some Unix systems). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not so much about "matching the behaviour of" as "accounting for the misbehaviour of" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll go for "accounting for the behaviour of" (don't want to be accused of being the Windows guy criticising Unix ;-))
docs/reference/pip.rst
Outdated
this may not always be possible. Pip will therefore attempt to recover cleanly | ||
if presented with incorrectly encoded build tool output, by translating | ||
unexpected byte sequences to Python-style hexadecimal escape sequences | ||
(\x80\xff, etc). However, it is still possible for output to be displayed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps show this as an inline Python string literal? That is:
``"\x80\xff"``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
pip/compat.py
Outdated
return s.decode(sys.__stdout__.encoding) | ||
except UnicodeDecodeError: | ||
return s.decode('utf_8') | ||
if sys.version_info > (3, 4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check will succeed for 3.4.x releases, which isn't what you want. sys.version_info >= (3, 5)
will do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks! I'll fix this.
pip/compat.py
Outdated
if sys.version_info > (3, 4): | ||
backslashreplace_decode = "backslashreplace" | ||
else: | ||
def backslashreplace_decode_fn(err): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding a comment here to say that while backslashreplace exists on these earlier versions, it's only usable for encoding, so this implements a version that specifically handles decoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
Thanks for the review @ncoghlan - I'll update the PR (and fix the pep8 failures) this evening. |
19ab087
to
1cf7e84
Compare
OK, I'm going to try to add some unit tests for Once I've done that, if the tests don't expose any more bugs, I'm going to merge this - if anyone has any further comments before then, just shout. |
Hello! I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the |
I found a workaround until pip 10 will be available. enter This will set UTF-8 codepage. Is there a plan, when this fix will be available to public? Milestone is not set here. |
@pfmoore - Can you share the latest status? I'd like to give this a shot to see if this fixes my issue on Russian Operating Systems. |
You can pull the latest development version from git, and try that. |
|
||
monkeypatch.setattr(locale, 'getpreferredencoding', lambda: 'utf-8') | ||
monkeypatch.setattr(pip.compat.logger, 'warning', check_warning) | ||
console_to_str(some_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should be asserted that this monkey patched function is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do I guess. The idea of the monkeypatching is to fix the behaviour rather than to test those functions are called, but it wouldn't do any harm.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
One more attempt to fix the various encoding issues we see. The main thing here is that we use the
errors="replace"
parameter when decoding subprocess output, so we should hopefully avoid many of the Unicode errors people are seeing.There's also an attempt to choose (and document) a better encoding that we expect from build tools. There are known issues with this approach (see the comments under #4280) but it should at least improve things.
Partial fix for #4110, #4003, #4212, Note that this PR does not address any issues that are reported as happening on Python 2, this only changes Python 3 behaviour.