Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars can't read timestamp[s] typed columns in parquet files made by pyarrow #2543

Closed
eitsupi opened this issue Feb 4, 2022 · 4 comments · Fixed by #2564
Closed

polars can't read timestamp[s] typed columns in parquet files made by pyarrow #2543

eitsupi opened this issue Feb 4, 2022 · 4 comments · Fixed by #2564

Comments

@eitsupi
Copy link
Contributor

eitsupi commented Feb 4, 2022

Are you using Python or Rust?

Python

What version of polars are you using?

polars-0.12.20

What operating system are you using polars on?

Linux (Debian 11)

Describe your bug.

Polars cannot accurately read the datetime from Parquet files created with timestamp[s] in pyarrow.

>>> pl.read_parquet("test.parquet")
shape: (2, 2)
┌───────────────────────┬─────────────────────┐
│ datetime[s]           ┆ datetime[ms]        │
│ ------                 │
│ datetime[ms]          ┆ datetime[ms]        │
╞═══════════════════════╪═════════════════════╡
│ +52671-12-25 12:26:402020-09-13 12:26:40 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ +55840-11-08 22:13:202023-11-14 22:13:20 │
└───────────────────────┴─────────────────────┘

I have not been able to determine if this is a problem with polars or arrow2, but since multiple Parquet readers other than polars did not have the problem, I do not think it is a problem with the official arrow library. I apologize if this is not a submission to the appropriate repository.

What are the steps to reproduce the behavior?

import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq

pt = pa.table([
    pa.array([1600000000, 1700000000], type=pa.timestamp("s")),
    pa.array([1600000000000, 1700000000000], type=pa.timestamp("ms"))
], names=["datetime[s]", "datetime[ms]"])

pq.write_table(pt, "test.parquet")

pl.read_parquet("test.parquet")

What is the expected behavior?

It needs to be interpreted as exact time like other tools such as pyarrow.

>>> pl.from_arrow(pq.read_table("test.parquet"))
shape: (2, 2)
┌─────────────────────┬─────────────────────┐
│ datetime[s]         ┆ datetime[ms]        │
│ ------                 │
│ datetime[ms]        ┆ datetime[ms]        │
╞═════════════════════╪═════════════════════╡
│ 2020-09-13 12:26:402020-09-13 12:26:40 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-11-14 22:13:202023-11-14 22:13:20 │
└─────────────────────┴─────────────────────┘
@ritchie46
Copy link
Member

@jorgecarleitao is this maybe a conversion error in arrow2/parquet2?

import pyarrow.parquet as pq
seconds = [1600000000, 1700000000]

pt = pa.table([
    pa.array(seconds, type=pa.timestamp("s")),
], names=["datetime[s]"])

pq.write_table(pt, "test.parquet")
df = pl.read_parquet("test.parquet")

# undo the conversion done by polars
seconds_read = df.to_series().cast(int) // 1000

for a, b in zip(seconds_read, seconds):
    print(a - b)

assert seconds_read.to_list() == seconds
1598400000000
1698300000000

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_58298/418428865.py in <module>
     15     print(a - b)
     16 
---> 17 assert seconds_read.to_list() == seconds

AssertionError: 


Writing /reading arrow2

Writing and reading timestamp with ms unit seems to go fine.

seconds = [1600000000, 1700000000]
df = pl.DataFrame({
    "time": seconds
}).with_column(pl.col("time").cast(pl.Datetime))

df.to_parquet("test.parquet")
df = pl.read_parquet("test.parquet")

assert df.to_series().cast(int).to_list() == seconds

@jorgecarleitao
Copy link
Collaborator

looking into it

@jorgecarleitao
Copy link
Collaborator

Done in jorgecarleitao/arrow2#803 . Thanks for the ping!

@ritchie46
Copy link
Member

Thanks for the fix!😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants