-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyYAML 5.3 not compatible with Jython #369
Comments
Thanks. It's not yet clear to me where it fails. So already importing yaml fails? and on which line does it fail? The error message only talks about "position 49-55". |
It fails at import. Jython considers lone surrogates like Not sure why the error message doesn't contain line information and not sure is the position correct either. The traceback shows that the problem is in the |
@anishathalye do you have any idea? |
Hm, that's a bit weird, considered that the usage here is in a regex that actually wants to avoid invalid unicode. |
The Jython issue I referred to earlier explains their reasoning why lone surrogates aren't supported. I think it also has something to do with how JVM works. I'm not sure is it possible to construct this regexp so that it would work also with Jython. If not, it is possible to have different regexp for Jython and others. Unfortunately Jython not liking lone surrogates at all means that the latter pattern cannot be in the source code directly. One possibility to handle that is using if has_ucs4:
NON_PRINTABLE = u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010ffff]'
elif sys.platform.startswith('java'):
# Jython doesn't support lone surrogates https://bugs.jython.org/issue2048
NON_PRINTABLE = u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]'
else:
# Need to use eval here due to the above Jython issue
NON_PRINTABLE = eval(r"u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uFFFD]|(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)'")
NON_PRINTABLE = re.compile(NON_PRINTABLE) |
@pekkaklarck how about moving the regex to an extra file and import it depending on OTOH I think there is an existing PR #124 rewriting this check without regex. maybe that could work, but it has to be updated. |
Having regexps in own modules that are imported based on needs would work too. I've used that trick with our project to hide differences between Python 2, Python 3, PyPy, Jython and IronPython when there has been more code. If there has been problem in just one line, I've typically used Not needing that |
I think in this case I would actually be ok with the |
I can do that but won't have time in the near future. I won't mind you or someone else taking care this. Would be a good first issue for someone new interested to contribute to open source. |
I think |
This patch was taken from #369 (comment), authored by Pekka Klärck <[email protected]>. In short, Jython doesn't support lone surrogates, so importing yaml (and in particular, loading `reader.py`) caused a UnicodeDecodeError. This patch works around this through a clever use of `eval` to defer evaluation of the string containing the lone surrogates, only doing it on non-Jython platforms. This is only done in `lib/yaml/reader.py` and not `lib3/yaml/reader.py` because Jython does not support Python 3. With this patch, Jython's behavior with respect to Unicode code points over 0xFFFF becomes as it was before 0716ae2. It still does not pass all the unit tests on Jython (passes 1275, fails 3, errors on 1); all the failing tests are related to unicode. Still, this is better than simply crashing upon `import yaml`. With this patch, all tests continue to pass on Python 2 / Python 3.
Fixed by #378 in 5.4 |
PyYAML 5.2 still worked but 5.3 crashes at import. Tested with Jython 2.7.0 and 2.7.2b2 and this is the result:
This seems to be caused by PR #351. Apparently the root cause is that Jython doesn't support lone surrogates at all. According to https://bugs.jython.org/issue2048 that is by design.
The text was updated successfully, but these errors were encountered: