Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read Parquet file with categorical type written with to_parquet #2009

Closed
mhconradt opened this issue Dec 7, 2021 · 5 comments
Closed

Comments

@mhconradt
Copy link
Contributor

Are you using Python or Rust?

Python

Which feature gates did you use?

N/A

What version of polars are you using?

0.10.26

What operating system are you using polars on?

macOS Big Sur

Describe your bug.

I was working with some trade data containing a categorical column (the market, i.e. "BTC-PERP"), wrote the output of some queries to Parquet files without setting use_pyarrow=True, but was unable to read the files.
Setting use_pyarrow=True, casting the categorical to Utf8, or removing the categorical data type both suppress the issue.

DataFrames containing some categorical types cannot be read after being written to parquet using the Rust engine (the default, it would be nice if use_pyarrow defaulted toTrue).

What are the steps to reproduce the behavior?

Here's a gist containing a reproduction and some things I tried

What is the actual behavior?

Reading the file raises the following:
OSError: Invalid: Output buffer size (28) must be 30 or larger.

What is the expected behavior?

I should be able to read the file regardless of engine / data type.

@ritchie46
Copy link
Member

ritchie46 commented Dec 8, 2021

I could reproduce this in arrow reading as well:

shape: (5, 2)
┌──────┬─────────┐
│ catsnumbers │
│ ------     │
│ cati64     │
╞══════╪═════════╡
│ "AA"0       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "AB"1       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "AC"2       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "AD"3       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "AE"4       │
└──────┴─────────┘

@jorgecarleitao have you got an idea what this is? I could not reproduce this in pure arrow yet. The categorical gets coerced to a dictionary of keys UInt32 and values Utf8<i64>. When written with pyarrow its ok, but with the current parquet writer in polars we create an invalid parquet file.

Output thread '' panicked at 'assertion failed: len <= output_buf.len()', /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/compression.rs:146:13 stack backtrace: 0: rust_begin_unwind at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/std/src/panicking.rs:498:5 1: core::panicking::panic_fmt at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/panicking.rs:107:14 2: core::panicking::panic at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/panicking.rs:48:5 3: parquet2::compression::decompress at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/compression.rs:146:13 4: parquet2::page::page_dict::read_dict_page at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/page/page_dict/mod.rs:56:9 5: parquet2::read::page_iterator::finish_page at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/page_iterator.rs:186:24 6: parquet2::read::page_iterator::build_page at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/page_iterator.rs:145:18 7: parquet2::read::page_iterator::next_page at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/page_iterator.rs:120:20 8: as core::iter::traits::iterator::Iterator>::next at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/page_iterator.rs:89:32 9: <&mut I as core::iter::traits::iterator::Iterator>::next at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/iter/traits/iterator.rs:3465:9 10: as fallible_streaming_iterator::FallibleStreamingIterator>::advance at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/streaming-decompression-0.1.0/src/lib.rs:85:20 11: as fallible_streaming_iterator::FallibleStreamingIterator>::advance at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/compression.rs:253:9 12: fallible_streaming_iterator::FallibleStreamingIterator::next at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/fallible-streaming-iterator-0.1.9/src/lib.rs:52:9 13: <&mut I as fallible_streaming_iterator::FallibleStreamingIterator>::next at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/fallible-streaming-iterator-0.1.9/src/lib.rs:330:9 14: arrow2::io::parquet::read::binary::dictionary::iter_to_array at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/binary/dictionary.rs:143:28 15: arrow2::io::parquet::read::dict_read at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/mod.rs:167:22 16: arrow2::io::parquet::read::page_iter_to_array at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/mod.rs:320:36 17: arrow2::io::parquet::read::column_iter_to_array at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/mod.rs:392:25 18: as core::iter::traits::iterator::Iterator>::next::{{closure}} at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/record_batch.rs:153:39 19: as core::iter::traits::iterator::Iterator>::try_fold::enumerate::{{closure}} at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/iter/adapters/enumerate.rs:85:27 20: core::iter::traits::iterator::Iterator::try_fold at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/iter/traits/iterator.rs:1995:21 21: as core::iter::traits::iterator::Iterator>::try_fold at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/iter/adapters/enumerate.rs:91:9 22: as core::iter::traits::iterator::Iterator>::next at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/record_batch.rs:140:17 23: polars_io::parquet::>::next_record_batch at /home/ritchie46/code/polars/polars/polars-io/src/parquet.rs:103:9 24: polars_io::finish_reader at /home/ritchie46/code/polars/polars/polars-io/src/lib.rs:81:29 25: as polars_io::SerReader>::finish at /home/ritchie46/code/polars/polars/polars-io/src/parquet.rs:152:9 26: polars::dataframe::PyDataFrame::read_parquet at /home/ritchie46/code/polars/py-polars/src/dataframe.rs:194:24 27: polars::dataframe::__init9016075107124976429::__wrap::{{closure}} at /home/ritchie46/code/polars/py-polars/src/dataframe.rs:65:1 28: pyo3::callback::handle_panic::{{closure}} at /home/ritchie46/.cargo/git/checkouts/pyo3-d009474511846c5e/5357442/src/callback.rs:247:9 29: std::panicking::try::do_call at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/std/src/panicking.rs:406:40 30: __rust_try 31: std::panicking::try at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/std/src/panicking.rs:370:19 32: std::panic::catch_unwind at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/std/src/panic.rs:133:14 33: pyo3::callback::handle_panic at /home/ritchie46/.cargo/git/checkouts/pyo3-d009474511846c5e/5357442/src/callback.rs:245:24 34: polars::dataframe::__init9016075107124976429::__wrap at /home/ritchie46/code/polars/py-polars/src/dataframe.rs:65:1 35: cfunction_call at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Objects/methodobject.c:543 36: _PyObject_MakeTpCall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Objects/call.c:191:18 37: _PyObject_VectorcallTstate at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:116:16 38: _PyObject_VectorcallTstate at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:103:1 39: PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:127 40: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:5075 41: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:3487 42: _PyEval_EvalFrame at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/internal/pycore_ceval.h:40:12 43: _PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4327:14 44: _PyFunction_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Objects/call.c:396:12 45: _PyObject_VectorcallTstate at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:118:11 46: PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:127 47: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:5075 48: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:3535 49: _PyEval_EvalFrame at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/internal/pycore_ceval.h:40:12 50: _PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4327:14 51: _PyFunction_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Objects/call.c:396:12 52: _PyObject_VectorcallTstate at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:118:11 53: PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:127 54: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:5075 55: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:3535 56: _PyEval_EvalFrame at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/internal/pycore_ceval.h:40:12 57: _PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4327:14 58: _PyEval_EvalCodeWithName at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4359:12 59: PyEval_EvalCodeEx at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4375:12 60: PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:826:12 61: run_eval_code_obj at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:1219 62: run_mod at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:1240 63: pyrun_file at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:1138 64: pyrun_simple_file at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:449 65: PyRun_SimpleFileExFlags at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:482 66: pymain_run_file at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Modules/main.c:379 67: pymain_run_python at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Modules/main.c:604:21 68: Py_RunMain at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Modules/main.c:683 69: Py_BytesMain at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Modules/main.c:1129 70: __libc_start_main 71: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. Traceback (most recent call last): File "/home/ritchie46/code/polars/memcheck/run.py", line 23, in _ = pl.read_parquet("df.parquet", use_pyarrow=False) File "/home/ritchie46/code/polars/py-polars/polars/io.py", line 712, in read_parquet return DataFrame.read_parquet( File "/home/ritchie46/code/polars/py-polars/polars/internals/frame.py", line 525, in read_parquet self._df = PyDataFrame.read_parquet( pyo3_runtime.PanicException: assertion failed: len <= output_buf.len()

@jorgecarleitao
Copy link
Collaborator

Most likely a bug in arrow2 or parquet2 :/ Filled in jorgecarleitao/arrow2#667

@jorgecarleitao
Copy link
Collaborator

jorgecarleitao commented Dec 9, 2021

Closed by jorgecarleitao/parquet2#72

Woops, wrong repo, wanted to close jorgecarleitao/arrow2#667 :/ Sorry about the noise.

@jorgecarleitao
Copy link
Collaborator

Found the root cause, patched it in main and released a new parquet2 0.8.1 with the patch. Unfortunately the files are un-recoverable as this was an error in writing the file according to the spec :(

@ritchie46
Copy link
Member

Found the root cause, patched it in main and released a new parquet2 0.8.1 with the patch. Unfortunately the files are un-recoverable as this was an error in writing the file according to the spec :(

That was fast. Thanks a lot!

@mhconradt this will be parched with a pypy release on the end of this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants