Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsers combinators #53

Open
vinipsmaker opened this issue Jul 8, 2021 · 7 comments
Open

Parsers combinators #53

vinipsmaker opened this issue Jul 8, 2021 · 7 comments

Comments

@vinipsmaker
Copy link
Contributor

vinipsmaker commented Jul 8, 2021

So, I've been keeping the topic of parsers combinators on the back of my head since ~2016 when you introduced me to the idea.

TBH I had some resistance to the approach of parser combinators because they're just too difficult to implement and of limited usefulness/reuse (or so I thought). However recently I was playing with re2c on a new project and in this new project I had the chance to declare several mini-parsers and jump from one to another while they reuse data pretty seamlessly. The idea immediately struck me. I was using parsers combinators.

Usually the talks around the topic focus on functional languages and the approaches I've seen are pretty complex (and limited too TBH). However on re2c I had the chance to combine parsers under an imperative approach and the result was pretty pleasant and much more natural on C++.

The experiment also made me realize that all this time I've been facing parsers combinators under the wrong angle. Usually parsers combinators are presented focusing on the reuse of matchers for mini-parsers and a few very simple combinators/glues. However it has weak appeal given the rules for mini-parsers (i.e. a few very simple regexes that could be implemented very easily by hand) (1) offer very little reuse and (2) only offer reuse for the easy logic. When I'm interested in reusing a parser I want to reuse the logic for the hard stuff (e.g. state/nested-level tracking on JSON parsers or the complex algorithm to determine the body length on HTTP). The hard logic is not on the matchers, but on the code for everything else. It's the logic from this big self-contained parser that I want to use, not a few matchers that I could have written myself.

That got me thinking: what is it that will enable me to reuse the logic from the big parser? What is it that re2c is doing to enable me to combine mini-parsers so seamlessly? The answer was right in front of me. Combining mini-parsers the re2c-way allows one to defer the handling of some token straight away to another mini-parser. Let's begin with a JSON example:

{ "hello": "world" }

Suppose you want a syntactic extension that allows one to encode dates in JSON documents:

{ "hello": %2021-07-07% }

What you want here is: when the token %2021-07-07% is reached, a different parser is invoked to handle it and then we advance the main parser as well. So, really, it's two things that are seen by the main parser:

  1. When it reaches a certain point, a hole appears. IOW, it's a chunked parser.
  2. When the hole is over, the state of the parser should be advanced as if it consumed a value token.

Trial.Protocol already offers #1. #2 can be emulated. For instance, before the parser for our syntactic date extension returns, it could call reader.next("null"). However I contend it's desirable to offer an interface that consumes the already-decoded token/value to skip some unnecessary logic. For instance, this mini date parser could be calling reader.next<token::null>() instead. Could we have this addition? I don't think it's a polemic addition to the interface.

That covers the main part of the problem. And then, for the last part, we have the matchers again: what to do on token::code::error_unexpected_token? If we're using a chunked parser then this should be a non-fatal error as it'll allow for a fallback mini-parser to be invoked to handle just that token. This change would allow one to reuse Trial.Protocol parser to parse JSON documents with comments, for instance. However this little point on non-fatal unexpected-token is polemic and it deserves way more thought than the previous points. In the meantime, could chunk_reader.next<token::null>() be added?

@breese
Copy link
Owner

breese commented Jul 12, 2021

chunk_parser has a function that may help. next(const view_type&) instructs the parser to continue from a new view while preserving some internal state (e.g. nesting.)

@vinipsmaker
Copy link
Contributor Author

next<class Token>() could have some of its checks converted from ec = token::code::error_* to assert(state != ...) and maybe be cheaper and have smaller generated code (assuming clang is not already smart enough to "execute" a lot of the code when constants are used... which usually it already does). It's not urgent tho. As you've said, next(const view_type&) can already be used. This issue is more like a place to discuss ideas.

@breese
Copy link
Owner

breese commented Jul 15, 2021

Maybe next(const view_type&) should be renamed to resume(view_type) to make its purpose more clear. We could also add this function to the normal reader.

@breese
Copy link
Owner

breese commented Jul 15, 2021

The mental model we have been using so far is to parse JSON until it encounters unknown data, at which point we switch to customized parsing.

Instead we could check for the custom data format before parsing as JSON. For instance, we could add a skip_until() algorithm that takes a predicate and continues skipping until the predicate is true. The predicate takes the tail view so it can detect custom data.

Something like this:

auto after = skip_until(reader, [] (view_type input) { return *input == '%'; });
// Do custom parsing now

@vinipsmaker
Copy link
Contributor Author

This idea reminds me about the ambiguity problem. re2c solves ambiguity by using the following rule:

If multiple rules match the same string, the earlier rule takes precedence.


we could add a skip_until()

I don't even think a helper algorithm is necessary. The user could just try the mini-parser before the JSON parser. He can use the mini_parser.literal().size() value to know how much input in json_reader.tail() should be skipped. That kinda is the solution to combine multiple JSON parsers already. For instance, I've mentioned using another chunked_parser to implement partial::{skip,parse}() in another message.

The chosen approach changes how the composed parser solves ambiguity. If multiple parsers match the same substream which one should take precedence? That choice can be fixed or can be left to the user.

If no API is added then the user will have the choice to try the mini-parser first as you've pointed out. That's something he can already do. Then the mini-parser will always have precedence over the builtin rules.

If the JSON parser is changed in a way that makes error_unexpected_token a non-terminal state then the user will have two choices that just affect the precedence (either the previous solution or the choice to try the JSON parser first). Honestly I don't think it's really important to let the user have a choice here. I can't really imagine how ambiguous rules can be a good idea to extend the JSON syntax.

However I'm only perceiving the conceptual solution here. How do you think performance would be affected, for instance? That's another perspective and I think you're in a better position to have trustworthy judgement here.

@breese
Copy link
Owner

breese commented Jul 15, 2021

I definitely believe that the user should be able to resolve ambiguity. This may require domain-specific knowledge which we do not have.

The mini-parser idea is actually how chunk_reader works. It copies its internal state so it can restore the state if parsing fails due to incomplete input. It also knows that it does not need a full copy of the nesting levels, as parsing failure can only mess with the top-most element. On the other hand, it does create a copy of the state on each next() call. In contrast, if the user creates his own copy (mini-parser) then he can continue parsing on that copy until it fails. So there will overall be fewer copies made. Keep in mind that sizeof(json::reader) is 328 bytes (on a ILP64 architecture), most of which is internally cached pointers to optimize parser performance.

The biggest challenge with making error_unexpected_token a non-terminal state is that it is returned in different contexts, some of which are clearly JSON parsing failures, such as a missing JSON Object value (e.g. {"key":}). There is no show-stopper here, but it does take some work it figure out which is which.

@vinipsmaker
Copy link
Contributor Author

In contrast, if the user creates his own copy (mini-parser) then he can continue parsing on that copy until it fails. So there will overall be fewer copies made.

I've lost you here. Can you elaborate? What do you mean by the user creating his own copy? What scenario of combining parsers would be this exactly?

The biggest challenge with making error_unexpected_token a non-terminal state [...] There is no show-stopper here, but it does take some work it figure out which is which.

Turning error_unexpected_token in a non-terminal state is useful for a limited scenario (you want parsers combinators, and you want the main JSON parser to take precedence over your mini-parsers), but I haven't reached this scenario yet.

So far the only show-stopper for me is the inability to distinguish between error_unexpected_token and "need more input". I can't implement support for concatenated JSON streams in tabjson without this (otherwise at best I'd be buffering more of the stream endlessly). I think I can already work around everything else.

The biggest challenge with making error_unexpected_token a non-terminal state is that it is returned in different contexts, some of which are clearly JSON parsing failures, such as a missing JSON Object value (e.g. {"key":}). There is no show-stopper here, but it does take some work it figure out which is which.

The design here is certainly tricky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants