UnicodeDecodeError with 1.3.1 and python2.7 #816

bf · 2014-06-01T20:13:43Z

When using nosetests on failing tests with output which contains non-ascii characters, I get the following error:

  File "/usr/bin/nosetests-2.7", line 9, in <module>
    load_entry_point('nose==1.3.1', 'console_scripts', 'nosetests-2.7')()
  File "/usr/lib/python2.7/site-packages/nose/core.py", line 121, in __init__
    **extra_args)
  File "/usr/lib/python2.7/unittest/main.py", line 95, in __init__
    self.runTests()
  File "/usr/lib/python2.7/site-packages/nose/core.py", line 207, in runTests
    result = self.testRunner.run(self.test)
  File "/usr/lib/python2.7/site-packages/nose/core.py", line 62, in run
    test(result)
  File "/usr/lib/python2.7/site-packages/nose/suite.py", line 176, in __call__
    return self.run(*arg, **kw)
  File "/usr/lib/python2.7/site-packages/nose/suite.py", line 223, in run
    test(orig)
  File "/usr/lib/python2.7/site-packages/nose/suite.py", line 176, in __call__
    return self.run(*arg, **kw)
  File "/usr/lib/python2.7/site-packages/nose/suite.py", line 223, in run
    test(orig)
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 45, in __call__
    return self.run(*arg, **kwarg)
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 138, in run
    result.addError(self, err)
  File "/usr/lib/python2.7/site-packages/nose/proxy.py", line 128, in addError
    formatted = plugins.formatError(self.test, err)
  File "/usr/lib/python2.7/site-packages/nose/plugins/manager.py", line 99, in __call__
    return self.call(*arg, **kw)
  File "/usr/lib/python2.7/site-packages/nose/plugins/manager.py", line 141, in chain
    result = meth(*arg, **kw)
  File "/usr/lib/python2.7/site-packages/nose/plugins/capture.py", line 74, in formatError
    test.capturedOutput = output = self.buffer
  File "/usr/lib/python2.7/site-packages/nose/plugins/capture.py", line 112, in _get_buffer
    return self._buf.getvalue()
  File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue
    self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 308: ordinal not in range(128)

The text was updated successfully, but these errors were encountered:

jszakmeister · 2014-07-25T09:15:13Z

Does this still exist in 1.3.3? Do you have a small test case that can reproduce this?

haavikko · 2015-03-17T10:54:34Z

I've experienced the same issue. Here's a minimal test case that triggers the problem for me.

# -*- coding: utf-8 -*-
class NoseEncodingTestCase(TestCase):
    def test_crash_nose(self):
        print u'äää'
        print '\xe4'
        self.fail()

So the problem is caused by mixing unicode and 8-bit str output in a test that fails.

StringIO code contains this comment:
The StringIO object can accept either Unicode or 8-bit strings,
but mixing the two may take some care. If both are used, 8-bit
strings that cannot be interpreted as 7-bit ASCII (that use the
8th bit) will cause a UnicodeError to be raised when getvalue()
is called.

nose==1.3.4 and django-nose==1.2

jszakmeister · 2015-03-17T11:21:05Z

What do you propose the solution to be? What should be captured? How should it be coerced? Yes, mixing non-unicode and unicode is a problem, but what do you think the correct behavior should be and why?

jszakmeister · 2015-03-17T11:21:42Z

BTW, thanks for the test case!

haavikko · 2015-03-17T14:24:13Z

I guess that a test case (especially a failing one) might output anything in stdout. Often such outputs do come from inside 3rd party code, so nose should be able to accept a mix of any outputs

I'm not very knowledgeable on this issue, but one (maybe totally unworkable) idea:

Implement a subclass of io.TextIOBase that uses io.BytesIO to store the data. Use this to wrap stdout.
Make TextIOBase subclass accept both str and unicode instances. Unicode is encoded with sys.stdout.encoding before being added to BytesIO.
When the contents need to be displayed, try to decode contents of the buffer as sys.stdout.encoding, but use errors=replace so that decoding doesn't choke on invalid byte sequences.
Maybe have a way for the end user to configure the encoding/decoding method used.

io library was added in Python 2.6, if older versions need to be supported the current StringIO solution can be preserved

jszakmeister · 2015-03-17T22:12:42Z

None of what you proposed works with Python 2.5 or 2.4. And I'm not real interested in maintaining yet another place where everything differs. Is there a solution that works across the board?

haavikko · 2015-03-18T20:15:30Z

Problem could be solved by implementing a subclass of StringIO that overrides just getvalue().
Idea is to preserve current functionality, except when encoding error is detected, force the contents of the buffer into ascii encoding.

Pseudocode something like:

def getvalue(self):
  try:
    return super(self, ...).getvalue()
  except UnicodeDecodeError:
    implement basically the same thing as StringIO.getvalue except
    for each buffer, check if it is unicode or str, and force it into ascii,
    use errors=replace to ignore invalid byte sequences.

This is not a nice and general solution (as using io.TextIOBase would be), but should be workable in all Python versions. Note: I've only checked Python 2.7 StringIO code, if internal implementation of StringIO is a lot different in earlier Python versions, this will fail. Also don't know about Python 3, possibly this bug does not even occur there.

jszakmeister · 2015-03-18T23:25:26Z

Doesn't this approach corrupt the output?

haavikko · 2015-03-19T10:20:22Z

Yes it does, if you have a list of buffers in who-knows-what encoding and you combine them into one string, that's what happens (but output containing some question marks is preferable to tests not running at all).

But there may be another option. The problem is caused by forcing a list of strings to use the same encoding, so don't do that. Skip calling getvalue() altogether, implement another method that goes through the internal StringIO buffer list and prints each buffer in turn. No need to force any encoding on them. Even then, depending on value of sys.stdout.encoding, sometimes the output will be corrupted in any case, but I don't see a way around that.

On Python 2, `sys.stdout` and `print` can normally handle any combination of `str` and `unicode` objects. However, `StringIO.StringIO` can only safely handle one or the other. If the program writes both a non-ASCII `unicode` string, and a non-ASCII `str` string, then the `getvalue()` method will fail with `UnicodeDecodeError` [1]. In nose, that causes the script to suddenly abort, with the cryptic `UnicodeDecodeError`. This fix catches `UnicodeError` when trying to get the captured output, and will replace the captured output with a warning message. Fixes nose-devs#816 [1] <https://github.com/python/cpython/blob/2.7/Lib/StringIO.py#L258>

On Python 2, `sys.stdout` and `print` can normally handle any combination of `str` and `unicode` objects. However, `StringIO.StringIO` can only safely handle one or the other. If the program writes both a `unicode` string, and a non-ASCII `str` string, then the `getvalue()` method will fail with `UnicodeDecodeError` [1]. In nose, that causes the script to suddenly abort, with the cryptic `UnicodeDecodeError`. This fix catches `UnicodeError` when trying to get the captured output, and will replace the captured output with a warning message. Fixes nose-devs#816 [1] <https://github.com/python/cpython/blob/2.7/Lib/StringIO.py#L258>

bf closed this as completed Mar 17, 2015

This was referenced Mar 23, 2016

Prevent crashing from UnicodeDecodeError #988

Open

Prevent crashing from UnicodeDecodeError nose-devs/nose2#288

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError with 1.3.1 and python2.7 #816

UnicodeDecodeError with 1.3.1 and python2.7 #816

bf commented Jun 1, 2014

jszakmeister commented Jul 25, 2014

haavikko commented Mar 17, 2015

jszakmeister commented Mar 17, 2015

jszakmeister commented Mar 17, 2015

haavikko commented Mar 17, 2015

jszakmeister commented Mar 17, 2015

haavikko commented Mar 18, 2015

jszakmeister commented Mar 18, 2015

haavikko commented Mar 19, 2015

UnicodeDecodeError with 1.3.1 and python2.7 #816

UnicodeDecodeError with 1.3.1 and python2.7 #816

Comments

bf commented Jun 1, 2014

jszakmeister commented Jul 25, 2014

haavikko commented Mar 17, 2015

jszakmeister commented Mar 17, 2015

jszakmeister commented Mar 17, 2015

haavikko commented Mar 17, 2015

jszakmeister commented Mar 17, 2015

haavikko commented Mar 18, 2015

jszakmeister commented Mar 18, 2015

haavikko commented Mar 19, 2015