Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for optimistic parsing #36

Closed
wants to merge 1 commit into from

Conversation

lunaris
Copy link

@lunaris lunaris commented Sep 15, 2023

Problem

Incremental parsing is super useful in the face of streamed data. A growing use case for streamed data that needs to be parsed is that returned by AIs/chatbots such as OpenAI's GPT. The GPT message payload { "content": "Hi, how are you doing today?" } might be sent as the following chunks (indented to try and improve readability):

{ "con
      ten
         t": "
              Hi, h
                   ow are
                          you doi
                                 ng to
                                      day?
                                          " }

If using a GPT function call, in which one can ask GPT to indicate when it would like to call some specified function with a set of arguments, GPT may respond with a (streamed) JSON payload that additionally contains stringified JSON. The payload { "name": "getWeather", "arguments": "{ \"city\": \"London\" }" } might be chunked as follows:

{ "nam
      e" : "g
             etWeat
                   her", "ar
                            gume
                                nts": "{
                                         \"
                                           city\":
                                                   \"Lon
                                                        don\" }"
                                                                 }

As it stands, @streamparser/json allows one to feed a parser with chunks as they come, but it doesn't allow one access to any intermediate state. This is discussed a bit in #31 and the rationale makes sense -- such intermediate state could involve "guessing" what might come next, or making assumptions about whether or not a full parse will eventually succeed. #31 provides a workaround, but it does not work (as far as I can tell) for arbitrarily-nested JSON substructures. Ideally, it'd be possible at any point in the parsing process to have a view of the JSON that "might eventually be produced", e.g. for updating a front-end rendering of the respond in real time. For instance, in the case of a function call such as { "name": "offerSuggestions", "arguments": "[\"First\", \"Second\"]" }, it would be useful to have a view that at various points (optimistically) provided the arrays [], ["Fir"], ["First"], ["First", "Sec"] and ["First", "Second"] (for instance, depending on how the chunks fall).

Proposal

This PR introduces "optimistic" variants of the existing tokenizer, token parser and combination JSON parser. The design is as follows:

  • The optimistic tokenizer works just like a normal tokenizer, except that when it has finished emitting "complete" tokens in response to processing a provided piece of input (the argument to write), it may emit an additional "incomplete" token. Some examples:
write('{ "foo": "bar" }')
-> complete LEFT_BRACE, complete STRING[foo], complete COLON, complete STRING[bar] complete RIGHT_BRACE
write('{ "fo')
complete LEFT_BRACE, incomplete STRING[fo]

write('o')
incomplete STRING[foo]

write('": "')
complete STRING[foo], complete COLON, incomplete STRING[]
  • An optimistic token parser expects tokens that are either complete or incomplete. On encoutering a complete token, the optimistic parser works as a normal parser. On encountering an incomplete token, the parser updates its internal state (e.g. adding a key to an object, or buffering a string), but does not advance state. This leaves the optimistic parser primed to accept more incomplete tokens or a complete token before actually moving on.
  • Additionally, the optimistic parser (as presented in this PR) does not utiltise an onValue-style approach, instead offering the "top-level" value being parsed at all points (though this value will be mutated over the course of a parse). This allows consumers to write a chunk synchronously (as they do today) and observe the optimistic parse results directly.
  • The optimistic JSON parser glues together the optimistic tokenizer and parser in the expected way.

Details and comments

I appreciate this PR may offer functionality that is deemed out of scope or not in line with the implementations currently offered by this library -- if there are changes or improvements that would alter this, please let me know! If not, thank you very much for the lovely library and providing such a great base to build these features (which we are already using) on!

  • I've extended the README to give a sense of the optimistic parsing support, but it might not be enough/might warrant its own piece of documentation.
  • This PR does not yet include a NodeJS implementation of these features, though if this design makes sense/can be moulded into something agreeable I don't think it would be too much effort to add these.
  • This PR includes a suite of tests for optimistic tokenizers and parsers, hopefully fairly in line with those already provided for features of the library.
  • Optimistic parsers are a bit weaker than normal parsers in that they don't have the paths and keepStack options/don't emit nested values. If this is useful/required these features could likely be added back in. For a first implementation I removed them to make it simpler to work out how optimistic parsing could work.

Thanks again for reading this and the library as it stands today!

This commit adds "optimistic" tokenizers and parsers to the `plainjs`
codebase. Optimistic tokenizers will emit "incomplete" tokens when they
think that they will eventually reach a state where a token can be
definitively produced. Optimistic parsers respond to incomplete tokens
and other states by presenting a view of the parsed value that is
expected to be in some sense a "subset" of the value that will
eventually be parsed (assuming, optimistically, that the stream will
constitute a valid value when all is said and done).
@juanjoDiaz
Copy link
Owner

Thanks for the contribution @lunaris
I'll take a look at this as soon as possible.
Probably next week 🙂

@juanjoDiaz
Copy link
Owner

Awesome work, @lunaris !

The switch block at the end of the Tokenizer is great. We can definitely keep that!
I would simply add it to the standard Tokenizer and add a boolean option emitPartialValues to use it or not.

For the TokenParser, we could use an approach more consistent with the current callback-based approachby having 2 callbacks: onValue and onPartialValue, with a similar emitPartialValues option.
That way we can keep things simple and keep the current paths and keepStack, while also emitting partial values.
The onPartialValue callback, as the onValue callback, will include the parent and the full stack.
That way, the user can always get the full in-process object as it.

I'm sure that there will not be any performance impact on the Tokenizer.
I need to evaluate the performance impact on the TokenParser.
If the impact is too big, my alternative proposal would be to emit the partial values only from the Tokenizer and in the JsonParser enrich that with the stack information from the TokenParser.

I'll try to work on some code and share it here for us to discuss.

Wdyt?

case OptimisticTokenizerStates.NUMBER_AFTER_E_AND_DIGIT:
this.onToken({
token: TokenType.NUMBER,
value: this.parseNumber(this.bufferedNumber.toString()),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place where this could fail.

If the number is complex -2, 2.3, 2e10, etc..., the parsing can fail.
I think that in that case we need to decide if not emit, emit with undefined value, or emit with a string value.

Wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a chance it can fail, 100% agree. That said, is there that chance? In my head:

  • NUMBER_AFTER_INITIAL_ZERO -- the only buffer we can have here is "-0", I think, which will parse as 0
  • NUMBER_AFTER_INITIAL_NON_ZERO -- I think here we have buffers like -1, 2, which again will parse
  • NUMBER_AFTER_DECIMAL -- -3.1, 2.2, which will parse
  • NUMBER_AFTER_E_AND_DIGIT -- I think things like 2e4, which will parse

Wholly possible I'm missing cases that would fail though and it's cheap to add the guard either way. Let me know!

@lunaris
Copy link
Author

lunaris commented Sep 20, 2023

All sounds good to me, though I think if we have to look at the approach of doing it in JSONParser (or some equivalent), it might be that we need access to some bits of (currently) private parser state (or maybe not; I've not given it much thought). Hopefully it can be done directly in the parser without affecting performance 🤞

Thanks for the review and the response!

@juanjoDiaz
Copy link
Owner

Closing as this was already released.

@juanjoDiaz juanjoDiaz closed this Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants