Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another attempt to fix encoding issues #4486

Merged
merged 10 commits into from
May 27, 2017
Merged

Another attempt to fix encoding issues #4486

merged 10 commits into from
May 27, 2017

Conversation

pfmoore
Copy link
Member

@pfmoore pfmoore commented May 16, 2017

One more attempt to fix the various encoding issues we see. The main thing here is that we use the errors="replace" parameter when decoding subprocess output, so we should hopefully avoid many of the Unicode errors people are seeing.

There's also an attempt to choose (and document) a better encoding that we expect from build tools. There are known issues with this approach (see the comments under #4280) but it should at least improve things.

Partial fix for #4110, #4003, #4212, Note that this PR does not address any issues that are reported as happening on Python 2, this only changes Python 3 behaviour.

@pfmoore
Copy link
Member Author

pfmoore commented May 16, 2017

@pypa/pip-committers I'd appreciate reviews of this, if possible. I have no easy means of testing on non-English systems (Unix or Windows), and I know our tests don't exercise situations like that, so this change is at best theoretically OK. I don't think we have any committers using non-English systems, but if anyone can give it a once-over that would be great.

@pfmoore pfmoore changed the title Another attempt to fix encoding issues [WIP] Another attempt to fix encoding issues May 20, 2017
@pfmoore
Copy link
Member Author

pfmoore commented May 20, 2017

Made this WIP. See discussions on distutils-sig under "PEP 517 - specifying build system in pyproject.toml" and specifically https://mail.python.org/pipermail/distutils-sig/2017-May/030442.html. In summary, Visual C can produce output with an inconsistent (mixed) encoding, so trying to deal with that (beyond simply not crashing) is pointless.

Copy link
Member

@xavfernandez xavfernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip/compat.py Outdated
return s.decode(sys.__stdout__.encoding)
except UnicodeDecodeError:
return s.decode('utf_8')
return s.decode(subprocess_encoding(), errors="replace")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I'd like pip to first try decoding without error handling and on UnicodeDecodeError switch to errors="replace" with a nice warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll work on that.

@pfmoore
Copy link
Member Author

pfmoore commented May 20, 2017

One further thought. We need the output to be printable, so it must only use characters representable in sys.stdout.encoding. To make sure that's the case, I think we probably need another round of transcoding, to strip out any characters that can't be printed.

I believe the required dance is:

  1. Try decoding using the expected encoding.
  2. If this fails, print a warning that the data is encoded incorrectly, and then decode with errors=replace.
  3. Re-encode in sys.stdout.encoding with errors=replace, to ensure that no non-printable characters are included.
  4. Decode back to Unicode using sys.stdout.encoding.

Result is a Unicode string that's a "best possible" representation of the subprocess output, assuming the expected encoding, using only characters that can be printed to sys.stdout without error.

At the moment, we're only doing this for Python 3. But for Python 2 it's still possible for the output to contain bytes that aren't valid in sys.stdout.encoding. Should we therefore do the same thing for Python 2? It'll need to be modified a bit to deal with the nasty Python 2 string model, but I think we do need to clean the data so it doesn't include characters that can't be printed without error.

@pfmoore
Copy link
Member Author

pfmoore commented May 22, 2017

Just pushed an update that more robustly tidies up subprocess output to avoid encoding errors. Includes @xavfernandez suggestion to warn if we can't use the expected encoding without loss, and chooses a default encoding based on discussions on distutils-sig.

Also, this version sanitises the data even on Python 2, as it's possible to get Unicode errors on output there, as well.

Reviews appreciated, but I'm intending to get this change in for the next version of pip, as even if we get the encoding details slightly wrong, switching to errors=replace is better regardless. Backslashreplace would be better still, but that's only allowed for decoding since Python 3.5. I guess I could do something like error_strategy = "backslashreplace" if sys.version_info > (3, 5) else "replace". What do people think? Is it worth it?

@dstufft
Copy link
Member

dstufft commented May 22, 2017

I could do something like error_strategy = "backslashreplace" if sys.version_info > (3, 5) else "replace". What do people think? Is it worth it?

That seems reasonable to me, though I'd maybe ask to put it into the compat module to keep the "random crap we can maybe delete as we drop support for Py versions" as contained to one module as we can.

@ncoghlan
Copy link
Member

ncoghlan commented May 23, 2017

Regarding backslashreplace as an error handler: you can use that safely for the encoding step regardless of version. The only point where it needs to be conditional is on the initial decoding step, as the change made in 3.5 was to add support for handling UnicodeDecodeError in addition to the existing handling of UnicodeEncodeError.

For 3.4 and earlier, rather than falling back to replace, you also have the option of registering a custom backslashreplace_decode error handler based on the examples in https://stackoverflow.com/questions/25442954/how-should-i-decode-bytes-using-ascii-without-losing-any-junk-bytes-if-xmlch/25443356#25443356

Something like:

    def backslashreplace_decode(err):
        raw_bytes = (ord(err.object[i]) for i in range(err.start, err.end))
        return u"".join(u"\\x%x" % c for c in raw_bytes), err.end

>>> codecs.register_error("backslashreplace_decode", backslashreplace_decode)
>>> b'\xd3PS-90AC'.decode("ascii", "backslashreplace_decode")
u'\\xd3PS-90AC'

requires that the output is written in a well-defined encoding, specifically
the encoding returned by python's ``locale.getpreferredencoding`` function, or
"utf8" if ``getpreferredencoding`` does not return a value (or returns "ASCII",
which ).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a complete sentence... :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. Teach me to rush the doc update :-( I'll revise the docs to fix this and take into account Nick's comments.

requires that the output is written in a well-defined encoding, specifically
the encoding returned by python's ``locale.getpreferredencoding`` function, or
"utf8" if ``getpreferredencoding`` does not return a value (or returns "ASCII",
which ).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most reliable incantation I know for detecting ASCII as the preferred encoding is: codecs.lookup(locale.getpreferredencoding()).name == "ascii"

The reason for that is that going through the codec system and then looking at the codec name automatically deals with the fact that different platforms may report the same encoding under different aliases. Most importantly for this purpose, some Linux systems report ASCII by its official spec number, 'ANSI_X3.4-1968'.

(The patch itself handles this correctly, but it makes sense to include it here for the benefit of build tool developers as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need that much detail, I'll reword to make the intent clear (that we use UTF-8 if the locale encoding is ASCII).

produce output in the correct encoding. In practice - and in particular
on Windows, where tools are inconsistent in their use of the "OEM" and
"ANSI" codepages - this may not always be possible, so pip will attempt to
recover cleanly if presented with incorrectly encoded build tool output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the backslashreplace discussions, perhaps append ", by translating unexpected byte sequences to Python-style hexadecimal escape sequences (\x80\xff, etc)."

recover cleanly if presented with incorrectly encoded build tool output.
However, pip cannot guarantee in that case that the displayed output will
not be corrupted (mojibake, or characters replaced with the standard
replacement character, often a question mark).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backslashreplace adjustment should be reliable enough that it makes sense to drop this sentence (it's technically still true, but the cases where even backslashreplace is insufficient are obscure enough that I think including it would create more confusion than it clears up)

news/4486.bugfix Outdated
@@ -0,0 +1 @@
Improve handling of Unicode output from build tools under Python 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussions, "Improve handling of improperly encoded text output from build tools" would be a more accurate description now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving out the phrase "improperly encoded", as it's mostly text that is properly encoded, just not in the same encoding that we're assuming. (There's no standard yet for what encoding setuptools should be using for output).

pip/compat.py Outdated
logger.warning(
"Subprocess output does not appear to be encoded as %s" %
encoding)
s = data.decode(encoding, errors="replace")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the step that would need to conditionally use the native backslashreplace on 3.5+, and a pip-provided emulation otherwise.

pip/compat.py Outdated
# that won't fail).
output_encoding = sys.__stderr__.encoding
if output_encoding:
s = s.encode(output_encoding, errors="replace")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step can unconditionally use backslashreplace

@pfmoore
Copy link
Member Author

pfmoore commented May 23, 2017

For 3.4 and earlier, rather than falling back to replace, you also have the option of registering a custom backslashreplace_decode error handler

That looks like a good option - I knew registering a handler was possible, but hadn't looked into how difficult it was. Using the built in handler for 3.5+ and registering our own implementation for older versions seems like a good way to go.

@pfmoore
Copy link
Member Author

pfmoore commented May 23, 2017

OK, I think I've dealt with all the review comments. @ncoghlan could you check if you're OK with the reworded docs in particular?

Copy link
Member

@ncoghlan ncoghlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me. Just the one actual bug (the version check), and a couple of minor comments inline.

requires that the output is written in a well-defined encoding, specifically
the encoding the user has configured for text output (which can be obtained in
Python using ``locale.getpreferredencoding``). If the configured encoding is
ASCII, pip assumes UTF-8 (to match the behaviour of some Unix systems).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so much about "matching the behaviour of" as "accounting for the misbehaviour of" :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll go for "accounting for the behaviour of" (don't want to be accused of being the Windows guy criticising Unix ;-))

this may not always be possible. Pip will therefore attempt to recover cleanly
if presented with incorrectly encoded build tool output, by translating
unexpected byte sequences to Python-style hexadecimal escape sequences
(\x80\xff, etc). However, it is still possible for output to be displayed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps show this as an inline Python string literal? That is:

``"\x80\xff"``

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

pip/compat.py Outdated
return s.decode(sys.__stdout__.encoding)
except UnicodeDecodeError:
return s.decode('utf_8')
if sys.version_info > (3, 4):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will succeed for 3.4.x releases, which isn't what you want. sys.version_info >= (3, 5) will do the right thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! I'll fix this.

pip/compat.py Outdated
if sys.version_info > (3, 4):
backslashreplace_decode = "backslashreplace"
else:
def backslashreplace_decode_fn(err):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding a comment here to say that while backslashreplace exists on these earlier versions, it's only usable for encoding, so this implements a version that specifically handles decoding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@pfmoore
Copy link
Member Author

pfmoore commented May 24, 2017

Thanks for the review @ncoghlan - I'll update the PR (and fix the pep8 failures) this evening.

@pfmoore pfmoore changed the title [WIP] Another attempt to fix encoding issues Another attempt to fix encoding issues May 24, 2017
@pfmoore pfmoore force-pushed the encoding branch 2 times, most recently from 19ab087 to 1cf7e84 Compare May 24, 2017 13:25
@pfmoore
Copy link
Member Author

pfmoore commented May 24, 2017

OK, I'm going to try to add some unit tests for console_to_str (specifically, just to exercise it to prove that it doesn't fail given bad input) - mostly because there was a typo in my code that the existing tests weren't catching.

Once I've done that, if the tests don't expose any more bugs, I'm going to merge this - if anyone has any further comments before then, just shout.

@pypa-bot
Copy link

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will eligible for code review and hopefully merging!

@pypa-bot pypa-bot added the needs rebase or merge PR has conflicts with current master label May 25, 2017
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label May 25, 2017
@shivan
Copy link

shivan commented Jun 2, 2017

I found a workaround until pip 10 will be available.

enter
chcp 65001
in the commandline before installing package via pip

This will set UTF-8 codepage.

Is there a plan, when this fix will be available to public? Milestone is not set here.

@dsudduth
Copy link

dsudduth commented Jun 2, 2017

@pfmoore - Can you share the latest status? I'd like to give this a shot to see if this fixes my issue on Russian Operating Systems.

@pfmoore
Copy link
Member Author

pfmoore commented Jun 2, 2017

You can pull the latest development version from git, and try that. pip install https://github.com/pypa/pip should also work.


monkeypatch.setattr(locale, 'getpreferredencoding', lambda: 'utf-8')
monkeypatch.setattr(pip.compat.logger, 'warning', check_warning)
console_to_str(some_bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be asserted that this monkey patched function is called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do I guess. The idea of the monkeypatching is to fix the behaviour rather than to test those functions are called, but it wouldn't do any harm.

@lock
Copy link

lock bot commented Jun 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 2, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 2, 2019
@pfmoore pfmoore deleted the encoding branch January 24, 2021 11:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants