-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsers combinators #53
Comments
|
|
Maybe |
The mental model we have been using so far is to parse JSON until it encounters unknown data, at which point we switch to customized parsing. Instead we could check for the custom data format before parsing as JSON. For instance, we could add a Something like this:
|
This idea reminds me about the ambiguity problem. re2c solves ambiguity by using the following rule:
I don't even think a helper algorithm is necessary. The user could just try the mini-parser before the JSON parser. He can use the The chosen approach changes how the composed parser solves ambiguity. If multiple parsers match the same substream which one should take precedence? That choice can be fixed or can be left to the user. If no API is added then the user will have the choice to try the mini-parser first as you've pointed out. That's something he can already do. Then the mini-parser will always have precedence over the builtin rules. If the JSON parser is changed in a way that makes However I'm only perceiving the conceptual solution here. How do you think performance would be affected, for instance? That's another perspective and I think you're in a better position to have trustworthy judgement here. |
I definitely believe that the user should be able to resolve ambiguity. This may require domain-specific knowledge which we do not have. The mini-parser idea is actually how The biggest challenge with making |
I've lost you here. Can you elaborate? What do you mean by the user creating his own copy? What scenario of combining parsers would be this exactly?
Turning So far the only show-stopper for me is the inability to distinguish between
The design here is certainly tricky. |
So, I've been keeping the topic of parsers combinators on the back of my head since ~2016 when you introduced me to the idea.
TBH I had some resistance to the approach of parser combinators because they're just too difficult to implement and of limited usefulness/reuse (or so I thought). However recently I was playing with re2c on a new project and in this new project I had the chance to declare several mini-parsers and jump from one to another while they reuse data pretty seamlessly. The idea immediately struck me. I was using parsers combinators.
Usually the talks around the topic focus on functional languages and the approaches I've seen are pretty complex (and limited too TBH). However on re2c I had the chance to combine parsers under an imperative approach and the result was pretty pleasant and much more natural on C++.
The experiment also made me realize that all this time I've been facing parsers combinators under the wrong angle. Usually parsers combinators are presented focusing on the reuse of matchers for mini-parsers and a few very simple combinators/glues. However it has weak appeal given the rules for mini-parsers (i.e. a few very simple regexes that could be implemented very easily by hand) (1) offer very little reuse and (2) only offer reuse for the easy logic. When I'm interested in reusing a parser I want to reuse the logic for the hard stuff (e.g. state/nested-level tracking on JSON parsers or the complex algorithm to determine the body length on HTTP). The hard logic is not on the matchers, but on the code for everything else. It's the logic from this big self-contained parser that I want to use, not a few matchers that I could have written myself.
That got me thinking: what is it that will enable me to reuse the logic from the big parser? What is it that re2c is doing to enable me to combine mini-parsers so seamlessly? The answer was right in front of me. Combining mini-parsers the re2c-way allows one to defer the handling of some token straight away to another mini-parser. Let's begin with a JSON example:
Suppose you want a syntactic extension that allows one to encode dates in JSON documents:
What you want here is: when the token
%2021-07-07%
is reached, a different parser is invoked to handle it and then we advance the main parser as well. So, really, it's two things that are seen by the main parser:Trial.Protocol already offers
#1
.#2
can be emulated. For instance, before the parser for our syntactic date extension returns, it could callreader.next("null")
. However I contend it's desirable to offer an interface that consumes the already-decoded token/value to skip some unnecessary logic. For instance, this mini date parser could be callingreader.next<token::null>()
instead. Could we have this addition? I don't think it's a polemic addition to the interface.That covers the main part of the problem. And then, for the last part, we have the matchers again: what to do on
token::code::error_unexpected_token
? If we're using a chunked parser then this should be a non-fatal error as it'll allow for a fallback mini-parser to be invoked to handle just that token. This change would allow one to reuse Trial.Protocol parser to parse JSON documents with comments, for instance. However this little point on non-fatal unexpected-token is polemic and it deserves way more thought than the previous points. In the meantime, couldchunk_reader.next<token::null>()
be added?The text was updated successfully, but these errors were encountered: