-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot read Parquet file with categorical type written with to_parquet #2009
Comments
I could reproduce this in arrow reading as well: shape: (5, 2)
┌──────┬─────────┐
│ cats ┆ numbers │
│ --- ┆ --- │
│ cat ┆ i64 │
╞══════╪═════════╡
│ "AA" ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "AB" ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "AC" ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "AD" ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "AE" ┆ 4 │
└──────┴─────────┘ @jorgecarleitao have you got an idea what this is? I could not reproduce this in pure arrow yet. The categorical gets coerced to a dictionary of keys Outputthread '' panicked at 'assertion failed: len <= output_buf.len()', /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/compression.rs:146:13 stack backtrace: 0: rust_begin_unwind at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/std/src/panicking.rs:498:5 1: core::panicking::panic_fmt at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/panicking.rs:107:14 2: core::panicking::panic at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/panicking.rs:48:5 3: parquet2::compression::decompress at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/compression.rs:146:13 4: parquet2::page::page_dict::read_dict_page at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/page/page_dict/mod.rs:56:9 5: parquet2::read::page_iterator::finish_page at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/page_iterator.rs:186:24 6: parquet2::read::page_iterator::build_page at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/page_iterator.rs:145:18 7: parquet2::read::page_iterator::next_page at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/page_iterator.rs:120:20 8: as core::iter::traits::iterator::Iterator>::next at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/page_iterator.rs:89:32 9: <&mut I as core::iter::traits::iterator::Iterator>::next at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/iter/traits/iterator.rs:3465:9 10: as fallible_streaming_iterator::FallibleStreamingIterator>::advance at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/streaming-decompression-0.1.0/src/lib.rs:85:20 11: as fallible_streaming_iterator::FallibleStreamingIterator>::advance at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.8.0/src/read/compression.rs:253:9 12: fallible_streaming_iterator::FallibleStreamingIterator::next at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/fallible-streaming-iterator-0.1.9/src/lib.rs:52:9 13: <&mut I as fallible_streaming_iterator::FallibleStreamingIterator>::next at /home/ritchie46/.cargo/registry/src/github.com-1ecc6299db9ec823/fallible-streaming-iterator-0.1.9/src/lib.rs:330:9 14: arrow2::io::parquet::read::binary::dictionary::iter_to_array at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/binary/dictionary.rs:143:28 15: arrow2::io::parquet::read::dict_read at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/mod.rs:167:22 16: arrow2::io::parquet::read::page_iter_to_array at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/mod.rs:320:36 17: arrow2::io::parquet::read::column_iter_to_array at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/mod.rs:392:25 18: as core::iter::traits::iterator::Iterator>::next::{{closure}} at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/record_batch.rs:153:39 19: as core::iter::traits::iterator::Iterator>::try_fold::enumerate::{{closure}} at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/iter/adapters/enumerate.rs:85:27 20: core::iter::traits::iterator::Iterator::try_fold at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/iter/traits/iterator.rs:1995:21 21: as core::iter::traits::iterator::Iterator>::try_fold at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/core/src/iter/adapters/enumerate.rs:91:9 22: as core::iter::traits::iterator::Iterator>::next at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/4320687/src/io/parquet/read/record_batch.rs:140:17 23: polars_io::parquet::>::next_record_batch at /home/ritchie46/code/polars/polars/polars-io/src/parquet.rs:103:9 24: polars_io::finish_reader at /home/ritchie46/code/polars/polars/polars-io/src/lib.rs:81:29 25: as polars_io::SerReader>::finish at /home/ritchie46/code/polars/polars/polars-io/src/parquet.rs:152:9 26: polars::dataframe::PyDataFrame::read_parquet at /home/ritchie46/code/polars/py-polars/src/dataframe.rs:194:24 27: polars::dataframe::__init9016075107124976429::__wrap::{{closure}} at /home/ritchie46/code/polars/py-polars/src/dataframe.rs:65:1 28: pyo3::callback::handle_panic::{{closure}} at /home/ritchie46/.cargo/git/checkouts/pyo3-d009474511846c5e/5357442/src/callback.rs:247:9 29: std::panicking::try::do_call at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/std/src/panicking.rs:406:40 30: __rust_try 31: std::panicking::try at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/std/src/panicking.rs:370:19 32: std::panic::catch_unwind at /rustc/acbe4443cc4c9695c0b74a7b64b60333c990a400/library/std/src/panic.rs:133:14 33: pyo3::callback::handle_panic at /home/ritchie46/.cargo/git/checkouts/pyo3-d009474511846c5e/5357442/src/callback.rs:245:24 34: polars::dataframe::__init9016075107124976429::__wrap at /home/ritchie46/code/polars/py-polars/src/dataframe.rs:65:1 35: cfunction_call at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Objects/methodobject.c:543 36: _PyObject_MakeTpCall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Objects/call.c:191:18 37: _PyObject_VectorcallTstate at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:116:16 38: _PyObject_VectorcallTstate at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:103:1 39: PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:127 40: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:5075 41: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:3487 42: _PyEval_EvalFrame at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/internal/pycore_ceval.h:40:12 43: _PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4327:14 44: _PyFunction_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Objects/call.c:396:12 45: _PyObject_VectorcallTstate at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:118:11 46: PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:127 47: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:5075 48: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:3535 49: _PyEval_EvalFrame at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/internal/pycore_ceval.h:40:12 50: _PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4327:14 51: _PyFunction_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Objects/call.c:396:12 52: _PyObject_VectorcallTstate at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:118:11 53: PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/cpython/abstract.h:127 54: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:5075 55: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:3535 56: _PyEval_EvalFrame at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Include/internal/pycore_ceval.h:40:12 57: _PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4327:14 58: _PyEval_EvalCodeWithName at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4359:12 59: PyEval_EvalCodeEx at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:4375:12 60: PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/ceval.c:826:12 61: run_eval_code_obj at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:1219 62: run_mod at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:1240 63: pyrun_file at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:1138 64: pyrun_simple_file at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:449 65: PyRun_SimpleFileExFlags at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Python/pythonrun.c:482 66: pymain_run_file at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Modules/main.c:379 67: pymain_run_python at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Modules/main.c:604:21 68: Py_RunMain at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Modules/main.c:683 69: Py_BytesMain at /home/conda/feedstock_root/build_artifacts/python-split_1631581389324/work/Modules/main.c:1129 70: __libc_start_main 71: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. Traceback (most recent call last): File "/home/ritchie46/code/polars/memcheck/run.py", line 23, in _ = pl.read_parquet("df.parquet", use_pyarrow=False) File "/home/ritchie46/code/polars/py-polars/polars/io.py", line 712, in read_parquet return DataFrame.read_parquet( File "/home/ritchie46/code/polars/py-polars/polars/internals/frame.py", line 525, in read_parquet self._df = PyDataFrame.read_parquet( pyo3_runtime.PanicException: assertion failed: len <= output_buf.len() |
Most likely a bug in |
Closed by jorgecarleitao/parquet2#72 Woops, wrong repo, wanted to close jorgecarleitao/arrow2#667 :/ Sorry about the noise. |
Found the root cause, patched it in main and released a new parquet2 |
That was fast. Thanks a lot! @mhconradt this will be parched with a pypy release on the end of this week. |
Are you using Python or Rust?
Python
Which feature gates did you use?
N/A
What version of polars are you using?
0.10.26
What operating system are you using polars on?
macOS Big Sur
Describe your bug.
I was working with some trade data containing a categorical column (the market, i.e. "BTC-PERP"), wrote the output of some queries to Parquet files without setting
use_pyarrow=True
, but was unable to read the files.Setting
use_pyarrow=True
, casting the categorical toUtf8
, or removing the categorical data type both suppress the issue.DataFrames containing some categorical types cannot be read after being written to parquet using the Rust engine (the default, it would be nice if
use_pyarrow
defaulted toTrue
).What are the steps to reproduce the behavior?
Here's a gist containing a reproduction and some things I tried
What is the actual behavior?
Reading the file raises the following:
OSError: Invalid: Output buffer size (28) must be 30 or larger.
What is the expected behavior?
I should be able to read the file regardless of engine / data type.
The text was updated successfully, but these errors were encountered: