Panic when reading an empty parquet file with logical type `Null` column #3565

pspeter · 2022-06-03T13:42:51Z

What language are you using?

Python

Have you tried latest version of polars?

[yes]

What version of polars are you using?

0.13.40

What operating system are you using polars on?

Windows 10

What language version are you using

Python 3.8.10

Describe your bug.

When trying to read or scan a parquet file with 0 rows (only metadata) with a column of (logical) type Null, a PanicException is thrown. This DataFrame could be created e.g. by saving an empty pandas DataFrame that contains at least one string (or other object) column (tested using pyarrow).

What are the steps to reproduce the behavior?

import polars as pl
import pandas as pd

filepath = "/tmp/empty.parquet"
df = pd.DataFrame({"a": []}, dtype="str")
df.to_parquet(filepath)
pl.read_parquet(filepath)

What is the actual behavior?

thread '<unnamed>' panicked at 'attempt to divide by zero', /github/home/.cargo/git/checkouts/arrow2-945af624853845da/f7c3daf/src/io/parquet/read/deserialize/null.rs:21:27

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
/tmp/ipykernel_3424347/749936191.py in <module>
      3 df = pd.DataFrame({"a": []}, dtype="str")
      4 df.to_parquet(filepath)
----> 5 pl.read_parquet(filepath)

~/projects/jupyter/venv/lib/python3.8/site-packages/polars/io.py in read_parquet(source, columns, n_rows, use_pyarrow, memory_map, storage_options, parallel, row_count_name, row_count_offset, **kwargs)
    919             )
    920 
--> 921         return DataFrame._read_parquet(
    922             source_prep,
    923             columns=columns,

~/projects/jupyter/venv/lib/python3.8/site-packages/polars/internals/frame.py in _read_parquet(cls, file, columns, n_rows, parallel, row_count_name, row_count_offset)
    661         projection, columns = handle_projection_columns(columns)
    662         self = cls.__new__(cls)
--> 663         self._df = PyDataFrame.read_parquet(
    664             file,
    665             columns,

PanicException: attempt to divide by zero

What is the expected behavior?

Read an empty DataFrame. Pandas can read this empty parquet file just fine.

Additional Information

This is how empty pandas DataFrame with object columns are saved to parquet by default. The object columns are saved as INT32 physical type and Null logical type in the parquet schema:

>> import pyarrow.parquet as pq
>> pf = pq.ParquetFile(filepath)
>> for col in pf.schema:
>>     print(col)
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT32
  logical_type: Null
  converted_type (legacy): NONE

the arrow schema has them as NULL field as well:

>> for col in pf.schema_arrow:
>>    print(col)
pyarrow.Field<a: null>

and only the additional pandas metadata has the "correct" object numpy type:

>> print(pf.schema_arrow.pandas_metadata)
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 0,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'a',
   'field_name': 'a',
   'pandas_type': 'empty',
   'numpy_type': 'object',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '4.0.1'},
 'pandas_version': '1.4.1'}

So, I know this seems a bit obscure, but I had this happen in an ETL pipeline where sometimes, a batch could be empty. To still have a file for that batch, I created an empty DataFrame with the same columns as the expected output, and saved that as parquet file. However, I forgot to also specify the parquet schema while writing, so the (pandas) string columns got turned into Null columns (since pandas string columns are actually just object columns, I suppose). The next job in the pipeline was using polars and then crashed on read.

The text was updated successfully, but these errors were encountered:

pspeter added the bug Something isn't working label Jun 3, 2022

This was referenced Jun 9, 2022

Reading empty parquet leads to divide by zero. jorgecarleitao/arrow2#1060

Closed

update arrow #3650

Merged

ritchie46 closed this as completed in #3650 Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic when reading an empty parquet file with logical type `Null` column #3565

Panic when reading an empty parquet file with logical type `Null` column #3565

pspeter commented Jun 3, 2022

Panic when reading an empty parquet file with logical type Null column #3565

Panic when reading an empty parquet file with logical type Null column #3565

Comments

pspeter commented Jun 3, 2022

What language are you using?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

Additional Information

Panic when reading an empty parquet file with logical type `Null` column #3565

Panic when reading an empty parquet file with logical type `Null` column #3565