Parquet TZ aware datetimes not correctly read by the default engine #2723

FlorianGD · 2022-02-21T15:20:02Z

What language are you using?

Python

What version of polars are you using?

0.13.4

What operating system are you using polars on?

Ubuntu 20.04

What language version are you using

python 3.8.12

Describe your bug.

Parquet files with time zones are not accurately read by the default engine. Using pyarrow is fine, but slower.

What are the steps to reproduce the behavior?

import pandas as pd
import polars as pl
df = pd.DataFrame(data = {"Timestamp": pd.date_range("2022-01-01T00:00+00:00", "2022-01-01T10:00+00:00", freq="H")})
#                    Timestamp
# 0  2022-01-01 00:00:00+00:00
# 1  2022-01-01 01:00:00+00:00
# 2  2022-01-01 02:00:00+00:00
# 3  2022-01-01 03:00:00+00:00
# 4  2022-01-01 04:00:00+00:00
# 5  2022-01-01 05:00:00+00:00
# 6  2022-01-01 06:00:00+00:00
# 7  2022-01-01 07:00:00+00:00
# 8  2022-01-01 08:00:00+00:00
# 9  2022-01-01 09:00:00+00:00
# 10 2022-01-01 10:00:00+00:00
file = "/tmp/test.parquet"
df.to_parquet(file)
pd.read_parquet(file)  # File is read fine with timezone info kept
#                    Timestamp
# 0  2022-01-01 00:00:00+00:00
# 1  2022-01-01 01:00:00+00:00
# 2  2022-01-01 02:00:00+00:00
# 3  2022-01-01 03:00:00+00:00
# 4  2022-01-01 04:00:00+00:00
# 5  2022-01-01 05:00:00+00:00
# 6  2022-01-01 06:00:00+00:00
# 7  2022-01-01 07:00:00+00:00
# 8  2022-01-01 08:00:00+00:00
# 9  2022-01-01 09:00:00+00:00
# 10 2022-01-01 10:00:00+00:00
pl.read_parquet(file)  # Timezone is lost, but also the offset
# Conversion of timezone aware to naive datetimes. TZ information may be lost.
# shape: (11, 1)
# ┌─────────────────────────┐
# │ Timestamp               │
# │ ---                     │
# │ datetime[ns]            │
# ╞═════════════════════════╡
# │ 1970-01-19 23:49:55.200 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 1970-01-19 23:49:58.800 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 1970-01-19 23:50:02.400 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 1970-01-19 23:50:06     │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ ...                     │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 1970-01-19 23:50:20.400 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 1970-01-19 23:50:24     │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 1970-01-19 23:50:27.600 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 1970-01-19 23:50:31.200 │
# └─────────────────────────┘
pl.read_parquet(file, use_pyarrow=True)  # Now the offset is correct, but TZ info is lost
# Conversion of timezone aware to naive datetimes. TZ information may be lost.
# shape: (11, 1)
# ┌─────────────────────┐
# │ Timestamp           │
# │ ---                 │
# │ datetime[μs]        │
# ╞═════════════════════╡
# │ 2022-01-01 00:00:00 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 2022-01-01 01:00:00 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 2022-01-01 02:00:00 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 2022-01-01 03:00:00 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ ...                 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 2022-01-01 07:00:00 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 2022-01-01 08:00:00 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 2022-01-01 09:00:00 │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 2022-01-01 10:00:00 │
# └─────────────────────┘

Note that this happens only when the datetime has a timezone, if it is naïve, the datetime is correctly read.

What is the expected behavior?

As the default engine is faster, I would have liked it to read at least the correct offset.

Also, I did not manage to find an argument that I could pass so that the TZ is retained, maybe I missed something in the documentation.

Thank you for this project, I like it a lot, and I'll try to onboard my colleagues on it :)

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-02-21T15:24:23Z

Polars does not yet support timezones.

FlorianGD · 2022-02-21T16:19:45Z

Oh, OK, I did not know. But even without the TZ, would it be possible to have the correct offset for the dates?

ritchie46 · 2022-02-22T08:35:11Z

But even without the TZ, would it be possible to have the correct offset for the dates?

Yep, will look at that.

ritchie46 · 2022-02-23T08:22:45Z

jorgecarleitao/arrow2#861

ritchie46 mentioned this issue Feb 23, 2022

fix_rename #2740

Merged

ritchie46 mentioned this issue Feb 24, 2022

update arrow #2762

Merged

ritchie46 closed this as completed in #2762 Feb 24, 2022

wangkev mentioned this issue May 7, 2022

Timezone Support #3326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet TZ aware datetimes not correctly read by the default engine #2723

Parquet TZ aware datetimes not correctly read by the default engine #2723

FlorianGD commented Feb 21, 2022

ritchie46 commented Feb 21, 2022

FlorianGD commented Feb 21, 2022

ritchie46 commented Feb 22, 2022

ritchie46 commented Feb 23, 2022

Parquet TZ aware datetimes not correctly read by the default engine #2723

Parquet TZ aware datetimes not correctly read by the default engine #2723

Comments

FlorianGD commented Feb 21, 2022

What language are you using?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the expected behavior?

ritchie46 commented Feb 21, 2022

FlorianGD commented Feb 21, 2022

ritchie46 commented Feb 22, 2022

ritchie46 commented Feb 23, 2022