Reading large metadata-only `_metadata` file much slower than PyArrow #142

kylebarron · 2022-05-18T23:54:56Z

👋

I'm working with some large partitioned Parquet datasets that have a top-level _metadata file that contains the FileMetaData for every row group in every Parquet file in the directory. This _metadata file can have up to 30,000 row groups. In my experience, parsing these files with parquet2::read::read_metadata can be up to 70x slower than with pyarrow.parquet.read_metadata.

Python:

In [1]: import pyarrow.parquet as pq

In [2]: %timeit pq.read_metadata('_metadata')
20.1 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Arrow2:

use std::{fs::File, time::Instant};

use parquet2::read::read_metadata;

fn main() {
    let mut file = File::open("_metadata").unwrap();

    let now = Instant::now();
    let meta = read_metadata(&mut file).unwrap();
    println!("Time to parse metadata: {}", now.elapsed().as_secs_f32());
}

> cargo run
Time to parse metadata: 1.465529

Anecdotally, for a _metadata file internally with 30,000 row groups, it was taking ~11s to parse in arrow2 and ~160ms to parse in pyarrow. (Though in the making of this repro example, I learned that pyarrow.parquet.write_metadata is O(n^2) 😬, so I didn't create a full 30,000 setup for this example.)

I haven't looked at the code for read_metadata yet; do you have any ideas where this might be slower than with pyarrow?

Repro:

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq


def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

def main():
    schema, meta = create_example_file_meta_data()
    print('created collector')
    metadata_collector = [meta] * 5_000
    print('writing meta')
    pq.write_metadata(schema, '_metadata', metadata_collector=metadata_collector)

if __name__ == '__main__':
    main()

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2022-05-19T05:10:07Z

Could you try?

let mut file = BufReader::new(File::open("_metadata").unwrap());

Essentially, if there is no buffering, reading the metadata will require one syscall per thrift decoding, which will be expensive.

kylebarron · 2022-05-19T18:59:49Z

Ah I knew I had to be missing something. That said, updating the implementation to use BufReader is still 20x slower than Pyarrow; 400ms vs 20ms.

Time to parse metadata: 0.4077697

use std::{fs::File, time::Instant, io::BufReader};
use parquet2::read::read_metadata;

fn main() {
    let mut file = BufReader::new(File::open("_metadata").unwrap());

    let now = Instant::now();
    let meta = read_metadata(&mut file).unwrap();
    println!("Time to parse metadata: {}", now.elapsed().as_secs_f32());
}

jorgecarleitao · 2022-05-19T20:02:58Z

Thanks for the bench! I PRed #143 with a potential fix. Do you have the option to run it against this PR, just to check whether it indeed improves it?

kylebarron · 2022-05-19T20:25:30Z

That was fast! Unfortunately this branch seems to be a little bit slower, with runs hovering at about 490ms on my machine

> cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/parquet-metadata-demo`
Time to parse metadata: 0.48819956

Here's the code and Cargo lockfile in a gist.

And here's the _metadata file itself: _metadata.zip

jorgecarleitao · 2022-05-19T20:45:28Z

could you try cargo --release run, otherwise you will only get debug speed :)

kylebarron · 2022-05-19T21:03:27Z

🤦‍♂️ Sorry for the noise! I had no idea that there were such significant differences between debug and release. This is pretty close to the pyarrow benchmarks so I'll close this

With #143:

> cargo run --release
    Finished release [optimized] target(s) in 0.09s
     Running `target/release/parquet-metadata-demo`
Time to parse metadata: 0.06615424

On master:

> cargo run --release
   Compiling parquet2 v0.12.1
   Compiling parquet-metadata-demo v0.1.0 (/Users/kbarron/tmp/parquet-metadata-demo)
    Finished release [optimized] target(s) in 26.91s
     Running `target/release/parquet-metadata-demo`
Time to parse metadata: 0.062655576

jorgecarleitao · 2022-05-19T21:10:04Z

Thanks a lot for reporting back. No worries. For reference, the main work here is the thrift deserialization, which is CPU-bounded and thus prone to be optimized. This is likely the primary piece that affects the performance here.

kylebarron closed this as completed May 19, 2022

jorgecarleitao added question Further information is requested no-changelog labels May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading large metadata-only `_metadata` file much slower than PyArrow #142

Reading large metadata-only `_metadata` file much slower than PyArrow #142

kylebarron commented May 18, 2022 •

edited

Loading

jorgecarleitao commented May 19, 2022

kylebarron commented May 19, 2022

jorgecarleitao commented May 19, 2022

kylebarron commented May 19, 2022

jorgecarleitao commented May 19, 2022

kylebarron commented May 19, 2022

jorgecarleitao commented May 19, 2022

Reading large metadata-only _metadata file much slower than PyArrow #142

Reading large metadata-only _metadata file much slower than PyArrow #142

Comments

kylebarron commented May 18, 2022 • edited Loading

jorgecarleitao commented May 19, 2022

kylebarron commented May 19, 2022

jorgecarleitao commented May 19, 2022

kylebarron commented May 19, 2022

jorgecarleitao commented May 19, 2022

kylebarron commented May 19, 2022

jorgecarleitao commented May 19, 2022

Reading large metadata-only `_metadata` file much slower than PyArrow #142

Reading large metadata-only `_metadata` file much slower than PyArrow #142

kylebarron commented May 18, 2022 •

edited

Loading