You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
Running parquet_read the first issue I run into appears to originate from reading statistics:
> RUST_BACKTRACE=1 cargo run --release --features io_parquet,io_parquet_compression --example parquet_read sample.parquet
Finished release [optimized] target(s) in 0.16s
Running `target/release/examples/parquet_read sample.parquet`
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/io/parquet/read/statistics/primitive.rs:50:29
stack backtrace:
0: rust_begin_unwind
at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
2: core::panicking::panic
at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:111:5
3: arrow2::io::parquet::read::statistics::primitive::push
4: arrow2::io::parquet::read::statistics::push
5: arrow2::io::parquet::read::statistics::deserialize
6: parquet_read::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
If I comment out the statistics read (ln 24-27) of parquet_read.rs I get:
> RUST_BACKTRACE=1 cargo run --release --features io_parquet,io_parquet_compression --example parquet_read sample.parquet
Compiling arrow2 v0.16.0 (/home/kjschiroo/Desktop/arrow2)
Finished release [optimized] target(s) in 41.45s
Running `target/release/examples/parquet_read sample.parquet`
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/io/parquet/read/deserialize/primitive/basic.rs:229:40
stack backtrace:
0: rust_begin_unwind
at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
2: core::panicking::panic_bounds_check
at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:147:5
3: arrow2::io::parquet::read::deserialize::utils::extend_from_decoder
4: <arrow2::io::parquet::read::deserialize::primitive::basic::PrimitiveDecoder<T,P,F> as arrow2::io::parquet::read::deserialize::utils::Decoder>::extend_from_state
5: arrow2::io::parquet::read::deserialize::utils::extend_from_new_page
6: arrow2::io::parquet::read::deserialize::utils::next
7: <arrow2::io::parquet::read::deserialize::primitive::integer::IntegerIter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
8: <arrow2::io::parquet::read::deserialize::primitive::integer::IntegerIter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
9: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
10: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
11: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
12: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
13: core::iter::adapters::try_process
14: <arrow2::io::parquet::read::row_group::RowGroupDeserializer as core::iter::traits::iterator::Iterator>::next
15: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
16: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
17: parquet_read::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Which is the error I'd originally stumbled upon. Any thoughts on what might be up?
The text was updated successfully, but these errors were encountered:
@jorgecarleitao Thanks for responding to this so quickly! I'd noticed your PR said that writing date64 to parquet is implementation-defined which I hadn't been aware of. Is there any source you'd be able to point me towards so I can better understand the amount of interoperability that I should expect between parquet files created and consumed by different libraries?
In general the interoperability is high. The main exceptions are data types whose representation in one format (e.g. arrow) is not uniquely represented in another (e.g. parquet). In those cases, there is a tradeoff that libraries have to do.
In the case of date64, parquet supports dates in 32 bits. Arrow libraries must decide whether they write date64 in 32 bits parquet dates or in 64 bit parquet integers - this choice is implementation-defined.
Since date64 in Arrow is kind of useless because every value must be a multiple of 86400000 anyways, sticking to parquet int32 is likely best. Alternatively, avoid arrow date64 results in the highest possible compatibility.
Thanks! That's exactly what I was looking for! I didn't realize that date64 was in milliseconds since the epoch. I'd just assumed it must have been days.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The
parquet_read
example panics when reading the file generated by the following snippet of python:Generating the file:
Running
parquet_read
the first issue I run into appears to originate from reading statistics:If I comment out the statistics read (ln 24-27) of
parquet_read.rs
I get:Which is the error I'd originally stumbled upon. Any thoughts on what might be up?
The text was updated successfully, but these errors were encountered: