`parquet_read` panics when working with `date64`s #1400

kjschiroo · 2023-02-14T06:35:22Z

The parquet_read example panics when reading the file generated by the following snippet of python:

import datetime

import pyarrow as pa
import pyarrow.parquet

print(f"pyarrow {pa.__version__}")

table = pa.Table.from_pydict(
    {
        "my_column": pa.array(
            [datetime.date(2022, 6, 28)],
            pa.date64()
        )
    }
)
with open("sample.parquet", "wb") as f:
    pa.parquet.write_table(table=table, where=f, version="2.6", data_page_version="2.0", compression="SNAPPY")

Generating the file:

> python3 minimal.py
pyarrow 11.0.0

Running parquet_read the first issue I run into appears to originate from reading statistics:

> RUST_BACKTRACE=1 cargo run --release --features io_parquet,io_parquet_compression --example parquet_read sample.parquet 
    Finished release [optimized] target(s) in 0.16s
     Running `target/release/examples/parquet_read sample.parquet`
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/io/parquet/read/statistics/primitive.rs:50:29
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:111:5
   3: arrow2::io::parquet::read::statistics::primitive::push
   4: arrow2::io::parquet::read::statistics::push
   5: arrow2::io::parquet::read::statistics::deserialize
   6: parquet_read::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

If I comment out the statistics read (ln 24-27) of parquet_read.rs I get:

> RUST_BACKTRACE=1 cargo run --release --features io_parquet,io_parquet_compression --example parquet_read sample.parquet
   Compiling arrow2 v0.16.0 (/home/kjschiroo/Desktop/arrow2)
    Finished release [optimized] target(s) in 41.45s
     Running `target/release/examples/parquet_read sample.parquet`
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/io/parquet/read/deserialize/primitive/basic.rs:229:40
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_bounds_check
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:147:5
   3: arrow2::io::parquet::read::deserialize::utils::extend_from_decoder
   4: <arrow2::io::parquet::read::deserialize::primitive::basic::PrimitiveDecoder<T,P,F> as arrow2::io::parquet::read::deserialize::utils::Decoder>::extend_from_state
   5: arrow2::io::parquet::read::deserialize::utils::extend_from_new_page
   6: arrow2::io::parquet::read::deserialize::utils::next
   7: <arrow2::io::parquet::read::deserialize::primitive::integer::IntegerIter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
   8: <arrow2::io::parquet::read::deserialize::primitive::integer::IntegerIter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
   9: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  10: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  11: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  12: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
  13: core::iter::adapters::try_process
  14: <arrow2::io::parquet::read::row_group::RowGroupDeserializer as core::iter::traits::iterator::Iterator>::next
  15: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
  16: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
  17: parquet_read::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Which is the error I'd originally stumbled upon. Any thoughts on what might be up?

The text was updated successfully, but these errors were encountered:

kjschiroo · 2023-02-14T18:52:41Z

@jorgecarleitao Thanks for responding to this so quickly! I'd noticed your PR said that writing date64 to parquet is implementation-defined which I hadn't been aware of. Is there any source you'd be able to point me towards so I can better understand the amount of interoperability that I should expect between parquet files created and consumed by different libraries?

jorgecarleitao · 2023-02-15T01:24:40Z

In general the interoperability is high. The main exceptions are data types whose representation in one format (e.g. arrow) is not uniquely represented in another (e.g. parquet). In those cases, there is a tradeoff that libraries have to do.

In the case of date64, parquet supports dates in 32 bits. Arrow libraries must decide whether they write date64 in 32 bits parquet dates or in 64 bit parquet integers - this choice is implementation-defined.

Since date64 in Arrow is kind of useless because every value must be a multiple of 86400000 anyways, sticking to parquet int32 is likely best. Alternatively, avoid arrow date64 results in the highest possible compatibility.

The reference for pyarrow is here, where it says

(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

I hope this helps :)

kjschiroo · 2023-02-15T03:43:38Z

Thanks! That's exactly what I was looking for! I didn't realize that date64 was in milliseconds since the epoch. I'd just assumed it must have been days.

jorgecarleitao added the bug Something isn't working label Feb 14, 2023

jorgecarleitao mentioned this issue Feb 14, 2023

Improved support for date64 written by pyarrow to parquet #1402

Merged

jorgecarleitao closed this as completed in #1402 Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`parquet_read` panics when working with `date64`s #1400

`parquet_read` panics when working with `date64`s #1400

kjschiroo commented Feb 14, 2023

kjschiroo commented Feb 14, 2023

jorgecarleitao commented Feb 15, 2023

kjschiroo commented Feb 15, 2023

parquet_read panics when working with date64s #1400

parquet_read panics when working with date64s #1400

Comments

kjschiroo commented Feb 14, 2023

kjschiroo commented Feb 14, 2023

jorgecarleitao commented Feb 15, 2023

kjschiroo commented Feb 15, 2023

`parquet_read` panics when working with `date64`s #1400

`parquet_read` panics when working with `date64`s #1400