You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to read or scan a parquet file with 0 rows (only metadata) with a column of (logical) type Null, a PanicException is thrown. This DataFrame could be created e.g. by saving an empty pandas DataFrame that contains at least one string (or other object) column (tested using pyarrow).
thread '<unnamed>' panicked at 'attempt to divide by zero', /github/home/.cargo/git/checkouts/arrow2-945af624853845da/f7c3daf/src/io/parquet/read/deserialize/null.rs:21:27
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
/tmp/ipykernel_3424347/749936191.py in <module>
3 df = pd.DataFrame({"a": []}, dtype="str")
4 df.to_parquet(filepath)
----> 5 pl.read_parquet(filepath)
~/projects/jupyter/venv/lib/python3.8/site-packages/polars/io.py in read_parquet(source, columns, n_rows, use_pyarrow, memory_map, storage_options, parallel, row_count_name, row_count_offset, **kwargs)
919 )
920
--> 921 return DataFrame._read_parquet(
922 source_prep,
923 columns=columns,
~/projects/jupyter/venv/lib/python3.8/site-packages/polars/internals/frame.py in _read_parquet(cls, file, columns, n_rows, parallel, row_count_name, row_count_offset)
661 projection, columns = handle_projection_columns(columns)
662 self = cls.__new__(cls)
--> 663 self._df = PyDataFrame.read_parquet(
664 file,
665 columns,
PanicException: attempt to divide by zero
What is the expected behavior?
Read an empty DataFrame. Pandas can read this empty parquet file just fine.
Additional Information
This is how empty pandas DataFrame with object columns are saved to parquet by default. The object columns are saved as INT32 physical type and Null logical type in the parquet schema:
So, I know this seems a bit obscure, but I had this happen in an ETL pipeline where sometimes, a batch could be empty. To still have a file for that batch, I created an empty DataFrame with the same columns as the expected output, and saved that as parquet file. However, I forgot to also specify the parquet schema while writing, so the (pandas) string columns got turned into Null columns (since pandas string columns are actually just object columns, I suppose). The next job in the pipeline was using polars and then crashed on read.
The text was updated successfully, but these errors were encountered:
What language are you using?
Python
Have you tried latest version of polars?
What version of polars are you using?
0.13.40
What operating system are you using polars on?
Windows 10
What language version are you using
Python 3.8.10
Describe your bug.
When trying to read or scan a parquet file with 0 rows (only metadata) with a column of (logical) type
Null
, a PanicException is thrown. This DataFrame could be created e.g. by saving an empty pandas DataFrame that contains at least one string (or other object) column (tested using pyarrow).What are the steps to reproduce the behavior?
What is the actual behavior?
What is the expected behavior?
Read an empty DataFrame. Pandas can read this empty parquet file just fine.
Additional Information
This is how empty pandas DataFrame with object columns are saved to parquet by default. The object columns are saved as
INT32
physical type andNull
logical type in the parquet schema:the arrow schema has them as NULL field as well:
and only the additional pandas metadata has the "correct"
object
numpy type:So, I know this seems a bit obscure, but I had this happen in an ETL pipeline where sometimes, a batch could be empty. To still have a file for that batch, I created an empty DataFrame with the same columns as the expected output, and saved that as parquet file. However, I forgot to also specify the parquet schema while writing, so the (pandas) string columns got turned into Null columns (since pandas string columns are actually just object columns, I suppose). The next job in the pipeline was using polars and then crashed on read.
The text was updated successfully, but these errors were encountered: