-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading large metadata-only _metadata
file much slower than PyArrow
#142
Comments
Could you try?
Essentially, if there is no buffering, reading the metadata will require one syscall per thrift decoding, which will be expensive. |
Ah I knew I had to be missing something. That said, updating the implementation to use
use std::{fs::File, time::Instant, io::BufReader};
use parquet2::read::read_metadata;
fn main() {
let mut file = BufReader::new(File::open("_metadata").unwrap());
let now = Instant::now();
let meta = read_metadata(&mut file).unwrap();
println!("Time to parse metadata: {}", now.elapsed().as_secs_f32());
} |
Thanks for the bench! I PRed #143 with a potential fix. Do you have the option to run it against this PR, just to check whether it indeed improves it? |
That was fast! Unfortunately this branch seems to be a little bit slower, with runs hovering at about 490ms on my machine
Here's the code and Cargo lockfile in a gist. And here's the |
could you try |
🤦♂️ Sorry for the noise! I had no idea that there were such significant differences between With #143:
On master:
|
Thanks a lot for reporting back. No worries. For reference, the main work here is the thrift deserialization, which is CPU-bounded and thus prone to be optimized. This is likely the primary piece that affects the performance here. |
👋
I'm working with some large partitioned Parquet datasets that have a top-level
_metadata
file that contains theFileMetaData
for every row group in every Parquet file in the directory. This_metadata
file can have up to 30,000 row groups. In my experience, parsing these files withparquet2::read::read_metadata
can be up to 70x slower than withpyarrow.parquet.read_metadata
.Python:
Arrow2:
Anecdotally, for a
_metadata
file internally with 30,000 row groups, it was taking ~11s to parse inarrow2
and ~160ms to parse in pyarrow. (Though in the making of this repro example, I learned thatpyarrow.parquet.write_metadata
isO(n^2)
😬, so I didn't create a full 30,000 setup for this example.)I haven't looked at the code for
read_metadata
yet; do you have any ideas where this might be slower than withpyarrow
?Repro:
The text was updated successfully, but these errors were encountered: