-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Raw JSON Reader (~2.5x faster) #3479
Conversation
My basic plan is to do something similar to https://github.com/simdjson/simdjson/blob/master/doc/tape.md . There is a Rust implementation of simd-json but it is a bit heavy-weight for our needs, and is a fairly substantial dependency, so I'd like to try doing something simpler 😅 Not sure when exactly I'll get around to doing this, perhaps in a week's time |
@tustvold I saw this coming by https://github.com/PSeitz/serde_json_borrow we might take some inspiration from there or potentially use that crate? |
Thanks for the link. I'm actually part way through implementing a mechanism that decodes directly to arrow - I think this should give us the best possible performance. Just need to find some focus time to get it over the line |
Cool, happy to learn about the results / do a review! |
It needs some more cleanup, and I've not really spent any time trying to optimize it, but getting a more respectable 2.5x performance improvement with the new approach, albeit for a fair amount of additional code complexity...
|
9bcf19f
to
6ae1f06
Compare
arrow-json/src/raw/mod.rs
Outdated
} | ||
} | ||
|
||
trait ArrayDecoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is based on what we do for parquet, and will more naturally generalize to support arbitrarily nested data than the current implementation
/// Ok(std::iter::from_fn(move || next().transpose())) | ||
/// } | ||
/// ``` | ||
pub fn decode(&mut self, buf: &[u8]) -> Result<usize, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty chuffed with this interface, it should allow for streaming decode from object storage without having to first delimit rows, which I think is pretty cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should explicity mention this usecase in the doc comments somewhere to help others discover/understand why they might want to use this interface -- probably in the main struct doc comments. I know you say "facilitating integration with arbitrary byte streams," but I am thinking something very direct like:
"This interface allows streaming decode. For example, it can decode a stream of bytes directly from object storage without having to first delimit rows"
I am happy that this is now ready for review, whilst it is a fair amount of complexity, the 2.5x performance improvement I think justifies it. Furthermore, for async workloads this will be even more pronounced, as it avoids having to perform a pre-parse to delimit newlines. Data can be directly streamed from object storage, and fed into My plan is to get this integrated into DataFusion, fix the inevitable fallout, and then deprecate the old reader. |
Writing some integration tests, comparing RawReader found some divergence:
I'm not sure if this is a behaviour we wish to replicate, I at least found it very surprising Edit: In fact the list promotion logic is currently broken on master - #3601 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all, I love this change. 🏆 Thank you @tustvold.
I went through the code and tests carefully. I have some suggestions, but I don't think any are strictly required to merge this. The most important thing I think is some more tests, especially focused on error cases, as I mentioned in line.
Stepping back, I actually think this is a quite important feature for arrow-rs
and will serve us well. I imagine we can write up a great post about "how we made JSON decoding 2.5x faster" -- aka "look at this shiny JSON reader we have, you should try it out, and while you are here...." 🚀
Question: Why raw
for a name?
Maybe this is moot given the next question, but I didn't understand "raw" -- Some other possibly better names "v2", "fast", "direct"
Question: Why keep both json readers?
So I wonder why keep both original json reader https://docs.rs/arrow-json/31.0.0/arrow_json/reader/struct.Reader.html and this one?
Given the compelling performance improvements, it seems like we should simply switch to use the raw decoder and remove the existing one. This would
- improve user performance
- reduce our maintenance burden
- make the crate easier to use (no need to pick which decoder is desired)
- Ensure this reader passed all the same tests, etc
If we are thinking about a migration strategy, perhaps it could be like:
- Release the raw reader in arrow next
- Switch the default json reader to the raw reader in arrow next+1 (but keep the old reader around for another release)
- Remove the old reader in arrow next+2
Suggestion for (even more) tests
It would be awesome to get some sort of larger test corpus for this decoder. I wonder if there is some way to reuse the test suite in simdjson or similar 🤔 )
/// A tape encoding inspired by [simdjson] | ||
/// | ||
/// Uses `u32` for offsets to ensure `TapeElement` is 64-bits. A future | ||
/// iteration may increase this to a custom `u56` type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be valuable to inline as much of https://github.com/simdjson/simdjson/blob/master/doc/tape.md as is relevant to this implementation to document the tape format (maybe copy in tape.md from simdjson, update it is as appropriate, keeping a pointer back to the original)
Reasons:
- It would reduce people's questions about "what is different" (if this one is only "inspired")
- It would allow doc updates to this format along with code updates
|
||
{"a": "b", "object": {"nested": "hello", "foo": 23}, "b": {}, "c": {"foo": null }} | ||
|
||
{"a": ["", "foo", ["bar", "c"]], "b": {"1": []}, "c": {"2": [1, 2, 3]} } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I double checked this contains lists of objects 👍
arrow-json/src/raw/mod.rs
Outdated
/// A [`RecordBatchReader`] that reads newline-delimited JSON data with a known schema | ||
/// directly into the corresponding arrow arrays | ||
/// | ||
/// This makes it significantly faster than [`Reader`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help here to comment / explain to readers how to pick which reader to use. See my main PR review comments.
arrow-json/src/raw/mod.rs
Outdated
/// Create a [`RawDecoder`] with the provided schema and batch size | ||
pub fn try_new(schema: SchemaRef, batch_size: usize) -> Result<Self, ArrowError> { | ||
let decoder = make_decoder(DataType::Struct(schema.fields.clone()), false)?; | ||
// TODO: This should probably include nested fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still a todo? It seems like this is just an optimization to get the initial capacity sizing correct, not a correctness issue (it might help to make that clear)
/// Ok(std::iter::from_fn(move || next().transpose())) | ||
/// } | ||
/// ``` | ||
pub fn decode(&mut self, buf: &[u8]) -> Result<usize, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should explicity mention this usecase in the doc comments somewhere to help others discover/understand why they might want to use this interface -- probably in the main struct doc comments. I know you say "facilitating integration with arbitrary byte streams," but I am thinking something very direct like:
"This interface allows streaming decode. For example, it can decode a stream of bytes directly from object storage without having to first delimit rows"
assert_eq!(c.values(), &[3, 4]); | ||
} | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests I think would be good that I didn't see are for error conditions:
- Send in non UTF8 in json
- Send in partially / truncated json (both the first object and also subsequent objects)
} | ||
|
||
trait ArrayDecoder: Send { | ||
fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to document what the expected values in pos
are (indexes into the tape of starting elements?)
} | ||
|
||
if self.offsets.len() >= u32::MAX as usize { | ||
return Err(ArrowError::JsonError(format!("Encountered more than {} bytes of string data, consider using a smaller batch size", u32::MAX))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a good condition to cover if possible 🤔
] | ||
) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also have some tests for error cases:
- utf8 encoded data
- Invalid / corrupt utf8 data
- Truncated data (like a string that
"ends in the middle of a
)
It's hopefully temporary, but to allow for a grace period where both readers are supported
This is exactly what I intend to do 👍 As the current one exposes |
It is probably good to file a ticket with this overall plan to make it clearer -- I can do so if you would like |
} | ||
TapeElement::Number(idx) => { | ||
let s = tape.get_string(idx); | ||
let value = lexical_core::parse::<f64>(s.as_bytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this faster than std
? AFAIK the std parse
should be about as fast as lexical_core
by now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was faster when I benchmarked it, yes
784b3dd
to
b58a962
Compare
Benchmark runs are scheduled for baseline = 902a17d and contender = 0f1a92a. 0f1a92a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3441
Rationale for this change
This adds a new JSON reader that reads directly into arrow arrays, this leads to non-trivial performance improvements vs the current
serde_json::Value
approach, whilst also I think making the logic for handling nested schema easier to follow.What changes are included in this PR?
Are there any user-facing changes?