Improve Performance of JSON Reader #3441

marklit · 2023-01-03T18:48:10Z

Versions:

json2parquet 0.6.0 with the following Cargo packages:
- parquet = "29.0.0" (this is in the main branch but the file metadata states 23.0.0 for some reason)
- arrow = "29.0.0"
- arrow-schema = { version = "29.0.0", features = ["serde"] }
PyArrow 10.0.1
ClickHouse 22.13.1.1119

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.

I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .

{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.

$ cat California.jsonl \
    | clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    > cali.snappy.pq

The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).

$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
        target/release/json2parquet \
        -c snappy \
        California.jsonl \
        California.snappy.pq

The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.

In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('California.snappy.pq')

In [3]: pf.schema
Out[3]: 
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
  optional binary field_id=-1 capture_dates_range (String);
  optional binary field_id=-1 geom (String);
  optional int64 field_id=-1 release;
}

In [4]: pf.metadata
Out[4]: 
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
  created_by: parquet-rs version 29.0.0
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 12
  format_version: 1.0
  serialized_size: 7969

The ClickHouse-produced PQ file has 306 row groups.

In [1]: pf = pq.ParquetFile('cali.snappy.pq')

In [2]: pf.schema
Out[2]: 
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
  optional int64 field_id=-1 release;
  optional binary field_id=-1 capture_dates_range;
  optional binary field_id=-1 geom;
}

In [3]: pf.metadata
Out[3]: 
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 1.0
  serialized_size: 228389

I'm not sure if the row group sizes played into the performance delta.

Is there anything I can do to my compilation settings to speed up Parquet generation?

I checked with the author of json2parquet and he's certain there aren't issues within his code domoritz/json2parquet#116

tustvold · 2023-01-04T08:08:56Z

Is the bottleneck writing the parquet file or reading the JSON data, the JSON reader has not been heavily optimised yet? Perhaps you could use cargo-flamegraph or similar to confirm where time is being spent?

marklit · 2023-01-04T08:21:38Z

I noticed when running a 10-line extract of the JSONL file through both tools, ClickHouse called write twice while json2parquet called it 434 times (this is where it spent 61% of its time). read was only called 2% of the time which suggest to me the most amount of time is spent writing Parquet rather than reading JSONL.

$ sudo strace -wc \
    target/release/json2parquet \
    -c snappy \
    ../cali10.jsonl \
    test.snappy.pq

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 61.30    0.007141          16       434           write
 21.66    0.002523         280         9           openat
  4.33    0.000504          19        26           mmap
  2.73    0.000318         317         1           execve
  2.09    0.000244          16        15           read
...

For comparison, ClickHouse only called write twice with this workload.

$ sudo strace -wc \
    clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    < ../cali10.jsonl \
    > cali.snappy.pq

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 29.52    0.019018        1584        12           futex
 21.15    0.013625          63       214           gettid
 11.19    0.007209         514        14           mprotect
 11.06    0.007123         791         9         4 stat
  8.72    0.005617         108        52           close
  5.16    0.003327        1109         3           poll
  2.19    0.001412          23        60           mmap
  2.09    0.001344          39        34         1 openat
  1.27    0.000816          18        44           read
...
  0.15    0.000098          48         2           write

Dandandan · 2023-01-04T08:29:59Z

Could you try recompiling with arrow 30.0 to see if it has improved?

marklit · 2023-01-04T08:31:52Z

Certainly. I'll report back with my findings.

tustvold · 2023-01-04T08:43:07Z

I could be mistaken but strace only shows syscall latency, and not any time spent doing CPU bound work. Perf may give a more accurate picture. Perhaps something like perf record --call-graph dwarf -- <COMMAND>?

I would expect this task to not be all that IO bound on modern hardware

marklit · 2023-01-04T09:32:44Z

The following was run on version 30.0.0.

$ tail -n6 ~/json2parquet/Cargo.toml

[dependencies]
parquet = "30.0.0"
arrow = "30.0.0"
arrow-schema = { version = "30.0.0", features = ["serde"] }
serde_json = "1.0.91"
clap = { version = "4.0.32", features = ["derive"] }

I ran this comparison again on a fresh 16-core e2-highcpu-16 VM on GCP. ClickHouse took 13.7 seconds and json2parquet took 56.7 seconds to process the 11,542,912-row JSONL file. I noticed json2parquet was maxing out a single core while ClickHouse managed to hit 40-180% across 4 cores according to htop.

Below is the flamegraph of json2parquet processing the 11,542,912-row JSONL file.

$ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

$ git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph

$ cd ~/json2parquet

$ RUSTFLAGS='-Ctarget-cpu=native' \
        cargo build --release

$ sudo perf record \
    --call-graph dwarf \
    -- \
    target/release/json2parquet \
    -c snappy \
    ../California.jsonl \
    test.snappy.pq

$ sudo perf script \
    | ~/FlameGraph/stackcollapse-perf.pl \
    > out.perf-folded
$ ~/FlameGraph/flamegraph.pl \
    out.perf-folded \
    > perf.svg

tustvold · 2023-01-04T09:35:41Z

Thank you, I have some ideas of how to improve the performance of JSON reading, I'll write up a ticket over the coming days.

In the short-term you may be able to reduce the cost of infer_json_schema by limiting the number of records it reads, or even providing the schema up-front.

Dandandan · 2023-01-04T12:38:34Z

For faster parsing of json - one improvement might be to avoid allocations when generating a Value
simd-json has some options to do this (and also AFAIK to allow for buffer reuse)
https://docs.rs/simd-json/latest/simd_json/value/borrowed/index.html

tustvold · 2023-01-04T12:48:45Z

Yeah the reader is completely dominated by memory allocation, it should be possible to eliminate this just by using a custom Value. Switching so use simdjson is also an option, but is less of an invisible change 😄

Dandandan · 2023-01-04T18:56:26Z

Yeah the reader is completely dominated by memory allocation, it should be possible to eliminate this just by using a custom Value. Switching so use simdjson is also an option, but is less of an invisible change 😄

Sounds like a good idea to start with serde 😜

alamb · 2023-01-25T13:58:15Z

BTW -- #3479 is showing significant promise 🚀

marklit added the question Further information is requested label Jan 3, 2023

marklit mentioned this issue Jan 3, 2023

Performance improvement for compressing Snappy-compressed Parquet files? domoritz/json2parquet#116

Closed

tustvold changed the title ~~Performance improvement for compressing Snappy-compressed Parquet files?~~ Improve Performance of JSON Reader Jan 6, 2023

tustvold self-assigned this Jan 6, 2023

tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed question Further information is requested labels Jan 6, 2023

tustvold mentioned this issue Jan 19, 2023

Add Raw JSON Reader (~2.5x faster) #3479

Merged

tustvold closed this as completed in #3479 Jan 26, 2023

tustvold added the arrow Changes to the arrow crate label Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Performance of JSON Reader #3441

Improve Performance of JSON Reader #3441

marklit commented Jan 3, 2023 •

edited

Loading

tustvold commented Jan 4, 2023

marklit commented Jan 4, 2023 •

edited

Loading

Dandandan commented Jan 4, 2023 •

edited

Loading

marklit commented Jan 4, 2023

tustvold commented Jan 4, 2023

marklit commented Jan 4, 2023

tustvold commented Jan 4, 2023

Dandandan commented Jan 4, 2023

tustvold commented Jan 4, 2023

Dandandan commented Jan 4, 2023

alamb commented Jan 25, 2023

Improve Performance of JSON Reader #3441

Improve Performance of JSON Reader #3441

Comments

marklit commented Jan 3, 2023 • edited Loading

tustvold commented Jan 4, 2023

marklit commented Jan 4, 2023 • edited Loading

Dandandan commented Jan 4, 2023 • edited Loading

marklit commented Jan 4, 2023

tustvold commented Jan 4, 2023

marklit commented Jan 4, 2023

tustvold commented Jan 4, 2023

Dandandan commented Jan 4, 2023

tustvold commented Jan 4, 2023

Dandandan commented Jan 4, 2023

alamb commented Jan 25, 2023

marklit commented Jan 3, 2023 •

edited

Loading

marklit commented Jan 4, 2023 •

edited

Loading

Dandandan commented Jan 4, 2023 •

edited

Loading