Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

parquet_read panics when working with date64s #1400

Closed
kjschiroo opened this issue Feb 14, 2023 · 3 comments · Fixed by #1402
Closed

parquet_read panics when working with date64s #1400

kjschiroo opened this issue Feb 14, 2023 · 3 comments · Fixed by #1402
Labels
bug Something isn't working

Comments

@kjschiroo
Copy link
Contributor

The parquet_read example panics when reading the file generated by the following snippet of python:

import datetime

import pyarrow as pa
import pyarrow.parquet

print(f"pyarrow {pa.__version__}")

table = pa.Table.from_pydict(
    {
        "my_column": pa.array(
            [datetime.date(2022, 6, 28)],
            pa.date64()
        )
    }
)
with open("sample.parquet", "wb") as f:
    pa.parquet.write_table(table=table, where=f, version="2.6", data_page_version="2.0", compression="SNAPPY")

Generating the file:

> python3 minimal.py
pyarrow 11.0.0

Running parquet_read the first issue I run into appears to originate from reading statistics:

> RUST_BACKTRACE=1 cargo run --release --features io_parquet,io_parquet_compression --example parquet_read sample.parquet 
    Finished release [optimized] target(s) in 0.16s
     Running `target/release/examples/parquet_read sample.parquet`
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/io/parquet/read/statistics/primitive.rs:50:29
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:111:5
   3: arrow2::io::parquet::read::statistics::primitive::push
   4: arrow2::io::parquet::read::statistics::push
   5: arrow2::io::parquet::read::statistics::deserialize
   6: parquet_read::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

If I comment out the statistics read (ln 24-27) of parquet_read.rs I get:

> RUST_BACKTRACE=1 cargo run --release --features io_parquet,io_parquet_compression --example parquet_read sample.parquet
   Compiling arrow2 v0.16.0 (/home/kjschiroo/Desktop/arrow2)
    Finished release [optimized] target(s) in 41.45s
     Running `target/release/examples/parquet_read sample.parquet`
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/io/parquet/read/deserialize/primitive/basic.rs:229:40
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_bounds_check
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:147:5
   3: arrow2::io::parquet::read::deserialize::utils::extend_from_decoder
   4: <arrow2::io::parquet::read::deserialize::primitive::basic::PrimitiveDecoder<T,P,F> as arrow2::io::parquet::read::deserialize::utils::Decoder>::extend_from_state
   5: arrow2::io::parquet::read::deserialize::utils::extend_from_new_page
   6: arrow2::io::parquet::read::deserialize::utils::next
   7: <arrow2::io::parquet::read::deserialize::primitive::integer::IntegerIter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
   8: <arrow2::io::parquet::read::deserialize::primitive::integer::IntegerIter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
   9: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  10: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  11: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  12: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
  13: core::iter::adapters::try_process
  14: <arrow2::io::parquet::read::row_group::RowGroupDeserializer as core::iter::traits::iterator::Iterator>::next
  15: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
  16: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
  17: parquet_read::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Which is the error I'd originally stumbled upon. Any thoughts on what might be up?

@kjschiroo
Copy link
Contributor Author

@jorgecarleitao Thanks for responding to this so quickly! I'd noticed your PR said that writing date64 to parquet is implementation-defined which I hadn't been aware of. Is there any source you'd be able to point me towards so I can better understand the amount of interoperability that I should expect between parquet files created and consumed by different libraries?

@jorgecarleitao
Copy link
Owner

In general the interoperability is high. The main exceptions are data types whose representation in one format (e.g. arrow) is not uniquely represented in another (e.g. parquet). In those cases, there is a tradeoff that libraries have to do.

In the case of date64, parquet supports dates in 32 bits. Arrow libraries must decide whether they write date64 in 32 bits parquet dates or in 64 bit parquet integers - this choice is implementation-defined.

Since date64 in Arrow is kind of useless because every value must be a multiple of 86400000 anyways, sticking to parquet int32 is likely best. Alternatively, avoid arrow date64 results in the highest possible compatibility.

The reference for pyarrow is here, where it says

(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

I hope this helps :)

@kjschiroo
Copy link
Contributor Author

Thanks! That's exactly what I was looking for! I didn't realize that date64 was in milliseconds since the epoch. I'd just assumed it must have been days.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants