Introduce support for optimistic parsing #36

lunaris · 2023-09-15T08:21:55Z

Problem

Incremental parsing is super useful in the face of streamed data. A growing use case for streamed data that needs to be parsed is that returned by AIs/chatbots such as OpenAI's GPT. The GPT message payload { "content": "Hi, how are you doing today?" } might be sent as the following chunks (indented to try and improve readability):

If using a GPT function call, in which one can ask GPT to indicate when it would like to call some specified function with a set of arguments, GPT may respond with a (streamed) JSON payload that additionally contains stringified JSON. The payload { "name": "getWeather", "arguments": "{ \"city\": \"London\" }" } might be chunked as follows:

As it stands, @streamparser/json allows one to feed a parser with chunks as they come, but it doesn't allow one access to any intermediate state. This is discussed a bit in #31 and the rationale makes sense -- such intermediate state could involve "guessing" what might come next, or making assumptions about whether or not a full parse will eventually succeed. #31 provides a workaround, but it does not work (as far as I can tell) for arbitrarily-nested JSON substructures. Ideally, it'd be possible at any point in the parsing process to have a view of the JSON that "might eventually be produced", e.g. for updating a front-end rendering of the respond in real time. For instance, in the case of a function call such as { "name": "offerSuggestions", "arguments": "[\"First\", \"Second\"]" }, it would be useful to have a view that at various points (optimistically) provided the arrays [], ["Fir"], ["First"], ["First", "Sec"] and ["First", "Second"] (for instance, depending on how the chunks fall).

Proposal

This PR introduces "optimistic" variants of the existing tokenizer, token parser and combination JSON parser. The design is as follows:

The optimistic tokenizer works just like a normal tokenizer, except that when it has finished emitting "complete" tokens in response to processing a provided piece of input (the argument to write), it may emit an additional "incomplete" token. Some examples:

write('{ "foo": "bar" }')
-> complete LEFT_BRACE, complete STRING[foo], complete COLON, complete STRING[bar] complete RIGHT_BRACE

write('{ "fo')
complete LEFT_BRACE, incomplete STRING[fo]

write('o')
incomplete STRING[foo]

write('": "')
complete STRING[foo], complete COLON, incomplete STRING[]

An optimistic token parser expects tokens that are either complete or incomplete. On encoutering a complete token, the optimistic parser works as a normal parser. On encountering an incomplete token, the parser updates its internal state (e.g. adding a key to an object, or buffering a string), but does not advance state. This leaves the optimistic parser primed to accept more incomplete tokens or a complete token before actually moving on.
Additionally, the optimistic parser (as presented in this PR) does not utiltise an onValue-style approach, instead offering the "top-level" value being parsed at all points (though this value will be mutated over the course of a parse). This allows consumers to write a chunk synchronously (as they do today) and observe the optimistic parse results directly.
The optimistic JSON parser glues together the optimistic tokenizer and parser in the expected way.

Details and comments

I appreciate this PR may offer functionality that is deemed out of scope or not in line with the implementations currently offered by this library -- if there are changes or improvements that would alter this, please let me know! If not, thank you very much for the lovely library and providing such a great base to build these features (which we are already using) on!

I've extended the README to give a sense of the optimistic parsing support, but it might not be enough/might warrant its own piece of documentation.
This PR does not yet include a NodeJS implementation of these features, though if this design makes sense/can be moulded into something agreeable I don't think it would be too much effort to add these.
This PR includes a suite of tests for optimistic tokenizers and parsers, hopefully fairly in line with those already provided for features of the library.
Optimistic parsers are a bit weaker than normal parsers in that they don't have the paths and keepStack options/don't emit nested values. If this is useful/required these features could likely be added back in. For a first implementation I removed them to make it simpler to work out how optimistic parsing could work.

Thanks again for reading this and the library as it stands today!

This commit adds "optimistic" tokenizers and parsers to the `plainjs` codebase. Optimistic tokenizers will emit "incomplete" tokens when they think that they will eventually reach a state where a token can be definitively produced. Optimistic parsers respond to incomplete tokens and other states by presenting a view of the parsed value that is expected to be in some sense a "subset" of the value that will eventually be parsed (assuming, optimistically, that the stream will constitute a valid value when all is said and done).

juanjoDiaz · 2023-09-15T12:25:23Z

Thanks for the contribution @lunaris
I'll take a look at this as soon as possible.
Probably next week 🙂

juanjoDiaz · 2023-09-19T19:51:13Z

Awesome work, @lunaris !

The switch block at the end of the Tokenizer is great. We can definitely keep that!
I would simply add it to the standard Tokenizer and add a boolean option emitPartialValues to use it or not.

For the TokenParser, we could use an approach more consistent with the current callback-based approachby having 2 callbacks: onValue and onPartialValue, with a similar emitPartialValues option.
That way we can keep things simple and keep the current paths and keepStack, while also emitting partial values.
The onPartialValue callback, as the onValue callback, will include the parent and the full stack.
That way, the user can always get the full in-process object as it.

I'm sure that there will not be any performance impact on the Tokenizer.
I need to evaluate the performance impact on the TokenParser.
If the impact is too big, my alternative proposal would be to emit the partial values only from the Tokenizer and in the JsonParser enrich that with the stack information from the TokenParser.

I'll try to work on some code and share it here for us to discuss.

Wdyt?

juanjoDiaz · 2023-09-19T19:55:07Z

packages/plainjs/dist/deno/optimistic/tokenizer.ts

+        case OptimisticTokenizerStates.NUMBER_AFTER_E_AND_DIGIT:
+          this.onToken({
+            token: TokenType.NUMBER,
+            value: this.parseNumber(this.bufferedNumber.toString()),


This is the only place where this could fail.

If the number is complex -2, 2.3, 2e10, etc..., the parsing can fail.
I think that in that case we need to decide if not emit, emit with undefined value, or emit with a string value.

Wdyt?

If there's a chance it can fail, 100% agree. That said, is there that chance? In my head:

NUMBER_AFTER_INITIAL_ZERO -- the only buffer we can have here is "-0", I think, which will parse as 0

NUMBER_AFTER_INITIAL_NON_ZERO -- I think here we have buffers like -1, 2, which again will parse

NUMBER_AFTER_DECIMAL -- -3.1, 2.2, which will parse

NUMBER_AFTER_E_AND_DIGIT -- I think things like 2e4, which will parse

Wholly possible I'm missing cases that would fail though and it's cheap to add the guard either way. Let me know!

lunaris · 2023-09-20T06:52:49Z

All sounds good to me, though I think if we have to look at the approach of doing it in JSONParser (or some equivalent), it might be that we need access to some bits of (currently) private parser state (or maybe not; I've not given it much thought). Hopefully it can be done directly in the parser without affecting performance 🤞

Thanks for the review and the response!

juanjoDiaz · 2024-01-20T16:39:18Z

Closing as this was already released.

juanjoDiaz reviewed Sep 19, 2023

View reviewed changes

addievo mentioned this pull request Oct 25, 2023

RPC message parser can support arbitrary whitespace such as '\n' within and between JSON RPC messages MatrixAI/js-rpc#1

Open

2 tasks

juanjoDiaz mentioned this pull request Nov 7, 2023

Emit partial #37

Merged

juanjoDiaz closed this Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce support for optimistic parsing #36

Introduce support for optimistic parsing #36

lunaris commented Sep 15, 2023

juanjoDiaz commented Sep 15, 2023

juanjoDiaz commented Sep 19, 2023

juanjoDiaz Sep 19, 2023

lunaris Sep 20, 2023

lunaris commented Sep 20, 2023

juanjoDiaz commented Jan 20, 2024

Introduce support for optimistic parsing #36

Introduce support for optimistic parsing #36

Conversation

lunaris commented Sep 15, 2023

Problem

Proposal

Details and comments

juanjoDiaz commented Sep 15, 2023

juanjoDiaz commented Sep 19, 2023

juanjoDiaz Sep 19, 2023

Choose a reason for hiding this comment

lunaris Sep 20, 2023

Choose a reason for hiding this comment

lunaris commented Sep 20, 2023

juanjoDiaz commented Jan 20, 2024