Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of trailing TAB works differently for Python and C #594

Open
asomov opened this issue Dec 19, 2021 · 6 comments
Open

Parsing of trailing TAB works differently for Python and C #594

asomov opened this issue Dec 19, 2021 · 6 comments
Labels

Comments

@asomov
Copy link
Contributor

asomov commented Dec 19, 2021

This works properly (note the trailing TAB):

>>> from yaml import CLoader as Loader, CDumper as Dumper
>>> data = load('"bar"\t', Loader=Loader)

This fails:

>>> from yaml import Loader, Dumper
>>> data = load('"bar"\t', Loader=Loader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 114, in load
    return loader.get_single_data()
  File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 35, in get_single_node
    if not self.check_event(StreamEndEvent):
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 142, in parse_implicit_document_start
    if not self.check_token(DirectiveToken, DocumentStartToken,
  File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 258, in fetch_more_tokens
    raise ScannerError("while scanning for the next token", None,
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
  in "<unicode string>", line 1, column 6:
    "bar"	
         ^

@asomov asomov changed the title Parsing of traling TAB works differently for Python and C Parsing of trailing TAB works differently for Python and C Dec 19, 2021
@ingydotnet
Copy link
Member

The short answer to your query is that in this case libyaml is right and pyyaml is wrong.

https://play.yaml.io/main/parser?input=ImJhciIJ shows the results of 14 YAML parsers, and PyYAML, Ruamel (fork of PyYAML) and SnakeYAML get this one wrong.
The New Reference Parser there is literally generated from the spec productions and therefore is almost always correct in its interpretation.
That might be a useful resource for you.

The productions involved are:

Which is spaces and tabs.
Put another way, non-indentation whitespace is usually tabs and spaces.

@ingydotnet
Copy link
Member

Also re https://sourceforge.net/p/yaml/mailman/yaml-core/thread/CAHJtQJ4YE19fZS%2B7fGJ11P17w6P%2BPi27GcLXtdSv6L5uxeAofA%40mail.gmail.com/#msg37404600

In which you show the libyaml test suite not working, I was able to run this:

★ ~ $ git clone [email protected]:yaml/libyaml && (cd libyaml && ./bootstrap && ./configure && make test-suite)
...
ok 214 ZWK4: Key with anchor after missing explicit mapping value
1..214
ok
All tests successful.
Files=3, Tests=452,  7 wallclock secs ( 0.14 usr  0.00 sys +  6.22 cusr  2.08 csys =  8.44 CPU)
Result: PASS
make[1]: Leaving directory '/home/ingy/libyaml/tests/run-test-suite'

Hope that helps.

Note: I'll still be looking into improving the state of libyaml's testing.

@asomov
Copy link
Contributor Author

asomov commented Dec 21, 2021

It is not about right or wrong, it is about that the very same parser either succeeds of fails for the same YAML document.
It means that an import in Python not only change the performance but significantly changes the functionality.

@ingydotnet
Copy link
Member

ingydotnet commented Dec 21, 2021

Ah but they are not the very same parser. They are the 2 distinctly different parsers that PyYAML contains. A pure Python one and libyaml. Note that libyaml was originally a direct port from PyYAML, written by the same person.

There are several known places where PyYAML using pure Python and PyYAML using libyaml differ. These are of course bugs, either in the Python code or libyaml.
For the test case you posted, libyaml parses according to the spec, and PyYAML's python parser has a bug.
That's why I said libyaml is right and PyYAML (the Python code) is wrong.

Note: It was my understanding that you were trying to find out how to interpret the spec, so that you could implement your SnakeYAML Java YAML parser correctly.

@earonesty
Copy link

other repos have this bug because of pyyaml: docker/compose#5662

@asomov
Copy link
Contributor Author

asomov commented Jan 17, 2023

@ingydotnet I think this use case is not defined in the test suite
DE56 contains a lot of trailing TABS, but not this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants