Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyYAML 5.3 not compatible with Jython #369

Closed
pekkaklarck opened this issue Jan 7, 2020 · 11 comments · Fixed by #378
Closed

PyYAML 5.3 not compatible with Jython #369

pekkaklarck opened this issue Jan 7, 2020 · 11 comments · Fixed by #378
Labels

Comments

@pekkaklarck
Copy link

PyYAML 5.2 still worked but 5.3 crashes at import. Tested with Jython 2.7.0 and 2.7.2b2 and this is the result:

$ jython -c "import yaml"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/peke/Prog/jython2.7.0/Lib/site-packages/yaml/__init__.py", line 8, in <module>
    from loader import *
  File "/home/peke/Prog/jython2.7.0/Lib/site-packages/yaml/loader.py", line 4, in <module>
    from reader import *
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 49-55: illegal Unicode character

This seems to be caused by PR #351. Apparently the root cause is that Jython doesn't support lone surrogates at all. According to https://bugs.jython.org/issue2048 that is by design.

@perlpunk
Copy link
Member

perlpunk commented Jan 7, 2020

Thanks. It's not yet clear to me where it fails. So already importing yaml fails? and on which line does it fail? The error message only talks about "position 49-55".

@pekkaklarck
Copy link
Author

It fails at import. Jython considers lone surrogates like u'\uD800' invalid, same as there being a syntax error in the code.

Not sure why the error message doesn't contain line information and not sure is the position correct either. The traceback shows that the problem is in the reader module, though, and only change there since v5.2 seems to be introduced by PR #351. I also already tested that reverting the changed line fixes the error.

@perlpunk
Copy link
Member

perlpunk commented Jan 7, 2020

@anishathalye do you have any idea?

@perlpunk
Copy link
Member

perlpunk commented Jan 7, 2020

Jython considers lone surrogates like u'\uD800' invalid

Hm, that's a bit weird, considered that the usage here is in a regex that actually wants to avoid invalid unicode.
How would one test input for things like this if it's forbidden to use this in the source code?
I wonder how the json module is doing this, as I assume it has similar tests.
Note that I'm not very experienced in python.

@pekkaklarck
Copy link
Author

The Jython issue I referred to earlier explains their reasoning why lone surrogates aren't supported. I think it also has something to do with how JVM works.

I'm not sure is it possible to construct this regexp so that it would work also with Jython. If not, it is possible to have different regexp for Jython and others. Unfortunately Jython not liking lone surrogates at all means that the latter pattern cannot be in the source code directly. One possibility to handle that is using eval like in this example:

if has_ucs4:
    NON_PRINTABLE = u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010ffff]'
elif sys.platform.startswith('java'):
    # Jython doesn't support lone surrogates https://bugs.jython.org/issue2048 
    NON_PRINTABLE = u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]'
else:
    # Need to use eval here due to the above Jython issue
    NON_PRINTABLE = eval(r"u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uFFFD]|(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)'")
NON_PRINTABLE = re.compile(NON_PRINTABLE)

@perlpunk
Copy link
Member

perlpunk commented Jan 7, 2020

@pekkaklarck how about moving the regex to an extra file and import it depending on has_ucs4 and platform?

OTOH I think there is an existing PR #124 rewriting this check without regex. maybe that could work, but it has to be updated.

@pekkaklarck
Copy link
Author

pekkaklarck commented Jan 7, 2020

Having regexps in own modules that are imported based on needs would work too. I've used that trick with our project to hide differences between Python 2, Python 3, PyPy, Jython and IronPython when there has been more code. If there has been problem in just one line, I've typically used eval or exec.

Not needing that NON_PRINTABLE regexp at allo would obviously solve the problem nicely.

@perlpunk
Copy link
Member

perlpunk commented Jan 7, 2020

I think in this case I would actually be ok with the eval.
@pekkaklarck would you want to do a PR with the code you showed?

@pekkaklarck
Copy link
Author

I can do that but won't have time in the near future. I won't mind you or someone else taking care this. Would be a good first issue for someone new interested to contribute to open source.

@anishathalye
Copy link
Contributor

I think NON_PRINTABLE is necessary if you want to point to the specific place where there's an issue; if you don't care about providing as good error messages, you could invert the regex, have a PRINTABLE regex instead (which will avoid the lone surrogate issue), but then you won't be able to point to the specific place in the string where there is a non-printable character.

ingydotnet pushed a commit that referenced this issue Jan 13, 2021
This patch was taken from
#369 (comment),
authored by Pekka Klärck <[email protected]>.

In short, Jython doesn't support lone surrogates, so importing yaml (and
in particular, loading `reader.py`) caused a UnicodeDecodeError. This
patch works around this through a clever use of `eval` to defer
evaluation of the string containing the lone surrogates, only doing it
on non-Jython platforms.

This is only done in `lib/yaml/reader.py` and not `lib3/yaml/reader.py`
because Jython does not support Python 3.

With this patch, Jython's behavior with respect to Unicode code points
over 0xFFFF becomes as it was before
0716ae2. It still does not pass all the
unit tests on Jython (passes 1275, fails 3, errors on 1); all the
failing tests are related to unicode. Still, this is better than simply
crashing upon `import yaml`.

With this patch, all tests continue to pass on Python 2 / Python 3.
@perlpunk
Copy link
Member

Fixed by #378 in 5.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants