-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Performance of JSON Reader #3441
Comments
Is the bottleneck writing the parquet file or reading the JSON data, the JSON reader has not been heavily optimised yet? Perhaps you could use cargo-flamegraph or similar to confirm where time is being spent? |
I noticed when running a 10-line extract of the JSONL file through both tools, ClickHouse called $ sudo strace -wc \
target/release/json2parquet \
-c snappy \
../cali10.jsonl \
test.snappy.pq
For comparison, ClickHouse only called $ sudo strace -wc \
clickhouse local \
--input-format JSONEachRow \
-q "SELECT *
FROM table
FORMAT Parquet" \
< ../cali10.jsonl \
> cali.snappy.pq
|
Could you try recompiling with arrow 30.0 to see if it has improved? |
Certainly. I'll report back with my findings. |
I could be mistaken but strace only shows syscall latency, and not any time spent doing CPU bound work. Perf may give a more accurate picture. Perhaps something like I would expect this task to not be all that IO bound on modern hardware |
The following was run on version 30.0.0. $ tail -n6 ~/json2parquet/Cargo.toml [dependencies]
parquet = "30.0.0"
arrow = "30.0.0"
arrow-schema = { version = "30.0.0", features = ["serde"] }
serde_json = "1.0.91"
clap = { version = "4.0.32", features = ["derive"] } I ran this comparison again on a fresh 16-core Below is the flamegraph of $ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
$ cd ~/json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' \
cargo build --release
$ sudo perf record \
--call-graph dwarf \
-- \
target/release/json2parquet \
-c snappy \
../California.jsonl \
test.snappy.pq
$ sudo perf script \
| ~/FlameGraph/stackcollapse-perf.pl \
> out.perf-folded
$ ~/FlameGraph/flamegraph.pl \
out.perf-folded \
> perf.svg |
Thank you, I have some ideas of how to improve the performance of JSON reading, I'll write up a ticket over the coming days. In the short-term you may be able to reduce the cost of |
For faster parsing of json - one improvement might be to avoid allocations when generating a |
Yeah the reader is completely dominated by memory allocation, it should be possible to eliminate this just by using a custom |
Sounds like a good idea to start with |
BTW -- #3479 is showing significant promise 🚀 |
Versions:
I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.
I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.
I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.
The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).
The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.
The ClickHouse-produced PQ file has 306 row groups.
I'm not sure if the row group sizes played into the performance delta.
Is there anything I can do to my compilation settings to speed up Parquet generation?
I checked with the author of json2parquet and he's certain there aren't issues within his code domoritz/json2parquet#116
The text was updated successfully, but these errors were encountered: