LookupError: unknown encoding: utf16-le #6054

hroncok · 2018-11-30T10:28:31Z

Environment

pip version: 18.1
Python version: 3.7.1
OS: Fedora 30 s390x

This is a bug that manifests itself on a Big Endian architecture, when the tests are run.
However it can be examined on Little Endian as well.

Description

This is the test failure on s390x:

=================================== FAILURES ===================================
____________________ TestEncoding.test_auto_decode_utf16_le ____________________
self = <tests.unit.test_utils.TestEncoding object at 0x3ff9cb5b5c0>
    def test_auto_decode_utf16_le(self):
        data = (
            b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
>       assert auto_decode(data) == "Django==1.4.2"
tests/unit/test_utils.py:459: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
data = '\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'
    def auto_decode(data):
        """Check a bytes string for a BOM to correctly detect the encoding
    
        Fallback to locale.getpreferredencoding(False) like open() on Python3"""
        for bom, encoding in BOMS:
            if data.startswith(bom):
>               return data[len(bom):].decode(encoding)
E               LookupError: unknown encoding: utf16-le
src/pip/_internal/utils/encoding.py:25: LookupError

Expected behavior

The tests should pass on all architectures alike.

How to Reproduce

Get a big endian machine (virtualize maybe?)
Run the tests.

More info

I've checked and pip has:

pip/src/pip/_internal/utils/encoding.py

Lines 6 to 14 in e5ab7f6

    
           BOMS = [ 
        
               (codecs.BOM_UTF8, 'utf8'), 
        
               (codecs.BOM_UTF16, 'utf16'), 
        
               (codecs.BOM_UTF16_BE, 'utf16-be'), 
        
               (codecs.BOM_UTF16_LE, 'utf16-le'), 
        
               (codecs.BOM_UTF32, 'utf32'), 
        
               (codecs.BOM_UTF32_BE, 'utf32-be'), 
        
               (codecs.BOM_UTF32_LE, 'utf32-le'), 
        
           ]

And:

pip/src/pip/_internal/utils/encoding.py

Lines 23 to 25 in e5ab7f6

    
           for bom, encoding in BOMS: 
        
               if data.startswith(bom): 
        
                   return data[len(bom):].decode(encoding)

So this has 2 problems:

why does this fail on a big endian architecture and not on all?
pip tries to use nonexsiting encodings

I have a small reproducer here (run on my machine, x86_64):

>>> from pip._internal.utils.encoding import BOMS
>>> for bom, encoding in BOMS:
...     print(bom, encoding, end=': ')
...     try:
...         _ = ''.encode(encoding)
...         print('ok')
...     except Exception as e:
...         print(type(e), e)
... 
b'\xef\xbb\xbf' utf8: ok
b'\xff\xfe' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\xff\xfe\x00\x00' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

This is the output on s390x:

b'\xef\xbb\xbf' utf8: ok
b'\xfe\xff' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\x00\x00\xfe\xff' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

Clearly we see that utf16-be, utf16-le, utf32-be and utf32-le encoding are not even possible to use.
Is that expected? The code should not reach those anyway?

The testing bytestring is:

b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'

It starts with \xff\xfe and hence should be decoded by first encoding that has this bom. On little endian, that is utf16: Everything works, we haven't reached the nonexisiting encodings.

However on big endian system, the utf16 bom is big endian and hence the first item with the \xff\xfe bom is utf16-le - it blows up.

To reproduce this problem on little endian architectures, add a test_auto_decode_utf16_be tests with:

    def test_auto_decode_utf16_le(self):
        data = (
            b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
        assert auto_decode(data) == "Django==1.4.2"

>>> data = (
...     b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
...     b'=\x001\x00.\x004\x00.\x002\x00'
... )
>>> from pip._internal.utils.encoding import auto_decode
>>> auto_decode(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/pip/_internal/utils/encoding.py", line 25, in auto_decode
    return data[len(bom):].decode(encoding)
LookupError: unknown encoding: utf16-be

The text was updated successfully, but these errors were encountered:

hroncok · 2019-02-28T12:16:36Z

Still happens on 19.x.

cjerdonek · 2019-03-01T14:18:02Z

It looks like that code was added in PR #3485. @xavfernandez, can you take a look?

cjerdonek · 2019-03-01T14:22:23Z

It looks like the fix might be as simple as changing utf16-be to utf-16-be and similarly for the others.

There should be a regression test to iterate over the BOMS list and check that its entries are valid.

hroncok · 2019-03-01T14:25:16Z

Indeed, utf-16-be seems to exist.

hroncok · 2019-03-01T14:27:24Z

I'll submit a PR with the fix and regression test.

hroncok · 2019-03-01T14:55:26Z

#6311

pfmoore · 2019-03-01T15:14:46Z

The table of aliases here would seem to confirm that utf16-be isn't a valid alias (although utf-16be is...)

utils.encoding.auto_decode() was broken when decoding Big Endian BOM byte-strings on Little Endian or vice versa. The TestEncoding.test_auto_decode_utf_16_le test was failing on Big Endian systems, such as Fedora's s390x builders. A similar test, but with BE BOM test_auto_decode_utf_16_be was added in order to reproduce this on a Little Endian system (which is much easier to come by). A regression test was added to check that all listed encodings in utils.encoding.BOMS are valid. Fixes pypa#6054

lock · 2019-05-28T19:29:19Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

pradyunsg added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior C: encoding Related to text encoding and likely, UnicodeErrors labels Dec 14, 2018

hroncok mentioned this issue Mar 1, 2019

Fix utils.encoding.auto_decode() LookupError with invalid encodings #6311

Merged

cjerdonek removed the S: needs triage Issues/PRs that need to be triaged label Mar 1, 2019

cjerdonek closed this as completed in #6311 Mar 1, 2019

lock bot added the auto-locked Outdated issues that have been locked by automation label May 28, 2019

lock bot locked as resolved and limited conversation to collaborators May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LookupError: unknown encoding: utf16-le #6054

LookupError: unknown encoding: utf16-le #6054

hroncok commented Nov 30, 2018 •

edited

Loading

hroncok commented Feb 28, 2019

cjerdonek commented Mar 1, 2019

cjerdonek commented Mar 1, 2019

hroncok commented Mar 1, 2019

hroncok commented Mar 1, 2019

hroncok commented Mar 1, 2019

pfmoore commented Mar 1, 2019

lock bot commented May 28, 2019

LookupError: unknown encoding: utf16-le #6054

LookupError: unknown encoding: utf16-le #6054

Comments

hroncok commented Nov 30, 2018 • edited Loading

hroncok commented Feb 28, 2019

cjerdonek commented Mar 1, 2019

cjerdonek commented Mar 1, 2019

hroncok commented Mar 1, 2019

hroncok commented Mar 1, 2019

hroncok commented Mar 1, 2019

pfmoore commented Mar 1, 2019

lock bot commented May 28, 2019

hroncok commented Nov 30, 2018 •

edited

Loading