-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce support for optimistic parsing #36
Conversation
This commit adds "optimistic" tokenizers and parsers to the `plainjs` codebase. Optimistic tokenizers will emit "incomplete" tokens when they think that they will eventually reach a state where a token can be definitively produced. Optimistic parsers respond to incomplete tokens and other states by presenting a view of the parsed value that is expected to be in some sense a "subset" of the value that will eventually be parsed (assuming, optimistically, that the stream will constitute a valid value when all is said and done).
Thanks for the contribution @lunaris |
Awesome work, @lunaris ! The switch block at the end of the For the I'm sure that there will not be any performance impact on the I'll try to work on some code and share it here for us to discuss. Wdyt? |
case OptimisticTokenizerStates.NUMBER_AFTER_E_AND_DIGIT: | ||
this.onToken({ | ||
token: TokenType.NUMBER, | ||
value: this.parseNumber(this.bufferedNumber.toString()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only place where this could fail.
If the number is complex -2
, 2.3
, 2e10
, etc..., the parsing can fail.
I think that in that case we need to decide if not emit, emit with undefined value, or emit with a string value.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a chance it can fail, 100% agree. That said, is there that chance? In my head:
NUMBER_AFTER_INITIAL_ZERO
-- the only buffer we can have here is"-0"
, I think, which will parse as0
NUMBER_AFTER_INITIAL_NON_ZERO
-- I think here we have buffers like-1
,2
, which again will parseNUMBER_AFTER_DECIMAL
---3.1
,2.2
, which will parseNUMBER_AFTER_E_AND_DIGIT
-- I think things like2e4
, which will parse
Wholly possible I'm missing cases that would fail though and it's cheap to add the guard either way. Let me know!
All sounds good to me, though I think if we have to look at the approach of doing it in Thanks for the review and the response! |
Closing as this was already released. |
Problem
Incremental parsing is super useful in the face of streamed data. A growing use case for streamed data that needs to be parsed is that returned by AIs/chatbots such as OpenAI's GPT. The GPT message payload
{ "content": "Hi, how are you doing today?" }
might be sent as the following chunks (indented to try and improve readability):If using a GPT function call, in which one can ask GPT to indicate when it would like to call some specified function with a set of arguments, GPT may respond with a (streamed) JSON payload that additionally contains stringified JSON. The payload
{ "name": "getWeather", "arguments": "{ \"city\": \"London\" }" }
might be chunked as follows:As it stands,
@streamparser/json
allows one to feed a parser with chunks as they come, but it doesn't allow one access to any intermediate state. This is discussed a bit in #31 and the rationale makes sense -- such intermediate state could involve "guessing" what might come next, or making assumptions about whether or not a full parse will eventually succeed. #31 provides a workaround, but it does not work (as far as I can tell) for arbitrarily-nested JSON substructures. Ideally, it'd be possible at any point in the parsing process to have a view of the JSON that "might eventually be produced", e.g. for updating a front-end rendering of the respond in real time. For instance, in the case of a function call such as{ "name": "offerSuggestions", "arguments": "[\"First\", \"Second\"]" }
, it would be useful to have a view that at various points (optimistically) provided the arrays[]
,["Fir"]
,["First"]
,["First", "Sec"]
and["First", "Second"]
(for instance, depending on how the chunks fall).Proposal
This PR introduces "optimistic" variants of the existing tokenizer, token parser and combination JSON parser. The design is as follows:
write
), it may emit an additional "incomplete" token. Some examples:onValue
-style approach, instead offering the "top-level" value being parsed at all points (though this value will be mutated over the course of a parse). This allows consumers to write a chunk synchronously (as they do today) and observe the optimistic parse results directly.Details and comments
I appreciate this PR may offer functionality that is deemed out of scope or not in line with the implementations currently offered by this library -- if there are changes or improvements that would alter this, please let me know! If not, thank you very much for the lovely library and providing such a great base to build these features (which we are already using) on!
README
to give a sense of the optimistic parsing support, but it might not be enough/might warrant its own piece of documentation.paths
andkeepStack
options/don't emit nested values. If this is useful/required these features could likely be added back in. For a first implementation I removed them to make it simpler to work out how optimistic parsing could work.Thanks again for reading this and the library as it stands today!