Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading large metadata-only _metadata file much slower than PyArrow #142

Closed
kylebarron opened this issue May 18, 2022 · 7 comments
Closed
Labels
no-changelog question Further information is requested

Comments

@kylebarron
Copy link
Contributor

kylebarron commented May 18, 2022

👋

I'm working with some large partitioned Parquet datasets that have a top-level _metadata file that contains the FileMetaData for every row group in every Parquet file in the directory. This _metadata file can have up to 30,000 row groups. In my experience, parsing these files with parquet2::read::read_metadata can be up to 70x slower than with pyarrow.parquet.read_metadata.

Python:

In [1]: import pyarrow.parquet as pq

In [2]: %timeit pq.read_metadata('_metadata')
20.1 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Arrow2:

use std::{fs::File, time::Instant};

use parquet2::read::read_metadata;

fn main() {
    let mut file = File::open("_metadata").unwrap();

    let now = Instant::now();
    let meta = read_metadata(&mut file).unwrap();
    println!("Time to parse metadata: {}", now.elapsed().as_secs_f32());
}
> cargo run
Time to parse metadata: 1.465529

Anecdotally, for a _metadata file internally with 30,000 row groups, it was taking ~11s to parse in arrow2 and ~160ms to parse in pyarrow. (Though in the making of this repro example, I learned that pyarrow.parquet.write_metadata is O(n^2) 😬, so I didn't create a full 30,000 setup for this example.)

I haven't looked at the code for read_metadata yet; do you have any ideas where this might be slower than with pyarrow?

Repro:

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq


def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

def main():
    schema, meta = create_example_file_meta_data()
    print('created collector')
    metadata_collector = [meta] * 5_000
    print('writing meta')
    pq.write_metadata(schema, '_metadata', metadata_collector=metadata_collector)

if __name__ == '__main__':
    main()
@jorgecarleitao
Copy link
Owner

Could you try?

let mut file = BufReader::new(File::open("_metadata").unwrap());

Essentially, if there is no buffering, reading the metadata will require one syscall per thrift decoding, which will be expensive.

@kylebarron
Copy link
Contributor Author

Ah I knew I had to be missing something. That said, updating the implementation to use BufReader is still 20x slower than Pyarrow; 400ms vs 20ms.

Time to parse metadata: 0.4077697
use std::{fs::File, time::Instant, io::BufReader};
use parquet2::read::read_metadata;

fn main() {
    let mut file = BufReader::new(File::open("_metadata").unwrap());

    let now = Instant::now();
    let meta = read_metadata(&mut file).unwrap();
    println!("Time to parse metadata: {}", now.elapsed().as_secs_f32());
}

@jorgecarleitao
Copy link
Owner

Thanks for the bench! I PRed #143 with a potential fix. Do you have the option to run it against this PR, just to check whether it indeed improves it?

@kylebarron
Copy link
Contributor Author

That was fast! Unfortunately this branch seems to be a little bit slower, with runs hovering at about 490ms on my machine

> cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/parquet-metadata-demo`
Time to parse metadata: 0.48819956

Here's the code and Cargo lockfile in a gist.

And here's the _metadata file itself: _metadata.zip

@jorgecarleitao
Copy link
Owner

could you try cargo --release run, otherwise you will only get debug speed :)

@kylebarron
Copy link
Contributor Author

🤦‍♂️ Sorry for the noise! I had no idea that there were such significant differences between debug and release. This is pretty close to the pyarrow benchmarks so I'll close this

With #143:

> cargo run --release
    Finished release [optimized] target(s) in 0.09s
     Running `target/release/parquet-metadata-demo`
Time to parse metadata: 0.06615424

On master:

> cargo run --release
   Compiling parquet2 v0.12.1
   Compiling parquet-metadata-demo v0.1.0 (/Users/kbarron/tmp/parquet-metadata-demo)
    Finished release [optimized] target(s) in 26.91s
     Running `target/release/parquet-metadata-demo`
Time to parse metadata: 0.062655576

@jorgecarleitao
Copy link
Owner

Thanks a lot for reporting back. No worries. For reference, the main work here is the thrift deserialization, which is CPU-bounded and thus prone to be optimized. This is likely the primary piece that affects the performance here.

@jorgecarleitao jorgecarleitao added question Further information is requested no-changelog labels May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants