Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Performance of JSON Reader #3441

Closed
marklit opened this issue Jan 3, 2023 · 11 comments · Fixed by #3479
Closed

Improve Performance of JSON Reader #3441

marklit opened this issue Jan 3, 2023 · 11 comments · Fixed by #3479
Assignees
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@marklit
Copy link

marklit commented Jan 3, 2023

Versions:

  • json2parquet 0.6.0 with the following Cargo packages:
    • parquet = "29.0.0" (this is in the main branch but the file metadata states 23.0.0 for some reason)
    • arrow = "29.0.0"
    • arrow-schema = { version = "29.0.0", features = ["serde"] }
  • PyArrow 10.0.1
  • ClickHouse 22.13.1.1119

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.

I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .
{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.

$ cat California.jsonl \
    | clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    > cali.snappy.pq

The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).

$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
        target/release/json2parquet \
        -c snappy \
        California.jsonl \
        California.snappy.pq

The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.

In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('California.snappy.pq')

In [3]: pf.schema
Out[3]: 
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
  optional binary field_id=-1 capture_dates_range (String);
  optional binary field_id=-1 geom (String);
  optional int64 field_id=-1 release;
}

In [4]: pf.metadata
Out[4]: 
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
  created_by: parquet-rs version 29.0.0
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 12
  format_version: 1.0
  serialized_size: 7969

The ClickHouse-produced PQ file has 306 row groups.

In [1]: pf = pq.ParquetFile('cali.snappy.pq')

In [2]: pf.schema
Out[2]: 
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
  optional int64 field_id=-1 release;
  optional binary field_id=-1 capture_dates_range;
  optional binary field_id=-1 geom;
}

In [3]: pf.metadata
Out[3]: 
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 1.0
  serialized_size: 228389

I'm not sure if the row group sizes played into the performance delta.

Is there anything I can do to my compilation settings to speed up Parquet generation?

I checked with the author of json2parquet and he's certain there aren't issues within his code domoritz/json2parquet#116

@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2023

Is the bottleneck writing the parquet file or reading the JSON data, the JSON reader has not been heavily optimised yet? Perhaps you could use cargo-flamegraph or similar to confirm where time is being spent?

@marklit
Copy link
Author

marklit commented Jan 4, 2023

I noticed when running a 10-line extract of the JSONL file through both tools, ClickHouse called write twice while json2parquet called it 434 times (this is where it spent 61% of its time). read was only called 2% of the time which suggest to me the most amount of time is spent writing Parquet rather than reading JSONL.

$ sudo strace -wc \
    target/release/json2parquet \
    -c snappy \
    ../cali10.jsonl \
    test.snappy.pq
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 61.30    0.007141          16       434           write
 21.66    0.002523         280         9           openat
  4.33    0.000504          19        26           mmap
  2.73    0.000318         317         1           execve
  2.09    0.000244          16        15           read
...

For comparison, ClickHouse only called write twice with this workload.

$ sudo strace -wc \
    clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    < ../cali10.jsonl \
    > cali.snappy.pq
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 29.52    0.019018        1584        12           futex
 21.15    0.013625          63       214           gettid
 11.19    0.007209         514        14           mprotect
 11.06    0.007123         791         9         4 stat
  8.72    0.005617         108        52           close
  5.16    0.003327        1109         3           poll
  2.19    0.001412          23        60           mmap
  2.09    0.001344          39        34         1 openat
  1.27    0.000816          18        44           read
...
  0.15    0.000098          48         2           write

@Dandandan
Copy link
Contributor

Dandandan commented Jan 4, 2023

Could you try recompiling with arrow 30.0 to see if it has improved?

@marklit
Copy link
Author

marklit commented Jan 4, 2023

Certainly. I'll report back with my findings.

@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2023

I could be mistaken but strace only shows syscall latency, and not any time spent doing CPU bound work. Perf may give a more accurate picture. Perhaps something like perf record --call-graph dwarf -- <COMMAND>?

I would expect this task to not be all that IO bound on modern hardware

@marklit
Copy link
Author

marklit commented Jan 4, 2023

The following was run on version 30.0.0.

$ tail -n6 ~/json2parquet/Cargo.toml
[dependencies]
parquet = "30.0.0"
arrow = "30.0.0"
arrow-schema = { version = "30.0.0", features = ["serde"] }
serde_json = "1.0.91"
clap = { version = "4.0.32", features = ["derive"] }

I ran this comparison again on a fresh 16-core e2-highcpu-16 VM on GCP. ClickHouse took 13.7 seconds and json2parquet took 56.7 seconds to process the 11,542,912-row JSONL file. I noticed json2parquet was maxing out a single core while ClickHouse managed to hit 40-180% across 4 cores according to htop.

Below is the flamegraph of json2parquet processing the 11,542,912-row JSONL file.

$ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

$ git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph

$ cd ~/json2parquet

$ RUSTFLAGS='-Ctarget-cpu=native' \
        cargo build --release

$ sudo perf record \
    --call-graph dwarf \
    -- \
    target/release/json2parquet \
    -c snappy \
    ../California.jsonl \
    test.snappy.pq

$ sudo perf script \
    | ~/FlameGraph/stackcollapse-perf.pl \
    > out.perf-folded
$ ~/FlameGraph/flamegraph.pl \
    out.perf-folded \
    > perf.svg

perf

@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2023

Thank you, I have some ideas of how to improve the performance of JSON reading, I'll write up a ticket over the coming days.

In the short-term you may be able to reduce the cost of infer_json_schema by limiting the number of records it reads, or even providing the schema up-front.

@Dandandan
Copy link
Contributor

For faster parsing of json - one improvement might be to avoid allocations when generating a Value
simd-json has some options to do this (and also AFAIK to allow for buffer reuse)
https://docs.rs/simd-json/latest/simd_json/value/borrowed/index.html

@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2023

Yeah the reader is completely dominated by memory allocation, it should be possible to eliminate this just by using a custom Value. Switching so use simdjson is also an option, but is less of an invisible change 😄

@Dandandan
Copy link
Contributor

Yeah the reader is completely dominated by memory allocation, it should be possible to eliminate this just by using a custom Value. Switching so use simdjson is also an option, but is less of an invisible change 😄

Sounds like a good idea to start with serde 😜

@tustvold tustvold changed the title Performance improvement for compressing Snappy-compressed Parquet files? Improve Performance of JSON Reader Jan 6, 2023
@tustvold tustvold self-assigned this Jan 6, 2023
@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed question Further information is requested labels Jan 6, 2023
@alamb
Copy link
Contributor

alamb commented Jan 25, 2023

BTW -- #3479 is showing significant promise 🚀

@tustvold tustvold added the arrow Changes to the arrow crate label Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants