Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get an iterator of results #243

Open
charles-paperman opened this issue Sep 4, 2023 · 3 comments
Open

Get an iterator of results #243

charles-paperman opened this issue Sep 4, 2023 · 3 comments
Labels
acceptance: needs design Sounds good, but needs exploration and prototyping area: result Improvements in query result reporting help wanted External contributions welcome type: feature New feature or request
Milestone

Comments

@charles-paperman
Copy link
Collaborator

Would be nice to have an iterator of results so that we can post filter and/or deal with each match with a potentially slow external code easily without load in RAM all matches simultaneously.

Example of use case: a json document with a very large list of object, we filter them with jsonpath and obtained a sublist of them that have to be inserted in a DB. Loading then in RAM is not possible (potentially too big). So we want to do a slow operation with each of them and free them from memory after that.

@charles-paperman charles-paperman added the type: feature New feature or request label Sep 4, 2023
@github-actions github-actions bot added the acceptance: triage Waiting for owner's input label Sep 4, 2023
@github-actions
Copy link

github-actions bot commented Sep 4, 2023

Tagging @V0ldek for notifications

@V0ldek
Copy link
Member

V0ldek commented Sep 6, 2023

This is a really big feature.

The current engine does not support pausing/resuming. It also doesn't play well with the current architecture of Engine-Recorder-Sink &ndash the Recorder would have to pause the engine? There's eight different places where a match might be reported in the current main engine, and more if we count head skipping. All of those would have to be augmented to save the state of the engine and return to the caller. To add to the pain, in the general case the NodesRecorder performs reordering of results and so it doesn't report the matches immediatelly – it batches them on the stack and then can report many of them at the same time.

Not saying this is impossible, but it would almost certainly be an entirely new engine. In particular I suspect that simply adding this capability to the main engine would screw with SIMD code generation, even if the caller intended to consume the entire iterator immediately anyway.

If the concern here is memory consumption then there is a workaround with multithreading. You can spin up a thread for the engine and then another one as the consumer, and as the sink pass a wrapper around a bounded capacity queue/channel (e.g. crossbeam's ArrayQueue. That way you limit the RAM usage and the consumer can expose an iterator API which internally reads from the queue/channel.

I am going to file this into the "Future" category. We could explore adding multithreaded Sink support (maybe even an async one) earlier (the 1.1.0 target) – if that sounds appealing please let me know. I feel like a Sink impl for a channel would be useful, but it depends if multithreading is even an acceptable solution for the user.

@V0ldek V0ldek added this to the Future milestone Sep 6, 2023
@github-actions github-actions bot added acceptance: go ahead Reviewed, implementation can start and removed acceptance: triage Waiting for owner's input labels Sep 6, 2023
@V0ldek V0ldek added acceptance: needs design Sounds good, but needs exploration and prototyping help wanted External contributions welcome mod: engine area: result Improvements in query result reporting and removed acceptance: go ahead Reviewed, implementation can start labels Sep 6, 2023
@charles-paperman
Copy link
Collaborator Author

I think iterating through high level event would also allow a SAX-api style of interface. Would be really nice for application that needs the underlying classifiers but not the query compilation. I suspect this is hard to do but it would allow to build efficient validation as in simdjson.

The automata construction could then be build on the top of that API. Adding iterators would then simply changing this interface and the reporting results stuff.

@github-project-automation github-project-automation bot moved this from Todo to Merged in Active rsonpath development Sep 7, 2023
@github-project-automation github-project-automation bot moved this from Merged to Committed in Active rsonpath development Sep 7, 2023
@github-actions github-actions bot added the acceptance: triage Waiting for owner's input label Sep 7, 2023
@V0ldek V0ldek removed the acceptance: triage Waiting for owner's input label Sep 7, 2023
@V0ldek V0ldek removed the mod: engine label Oct 4, 2023
@V0ldek V0ldek moved this from Committed to Todo in Active rsonpath development Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acceptance: needs design Sounds good, but needs exploration and prototyping area: result Improvements in query result reporting help wanted External contributions welcome type: feature New feature or request
Projects
Development

No branches or pull requests

2 participants