Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Record API: Cannot convert date before Unix epoch to json #3430

Closed
ByteBaker opened this issue Jan 2, 2023 · 4 comments · Fixed by #3437
Closed

Parquet Record API: Cannot convert date before Unix epoch to json #3430

ByteBaker opened this issue Jan 2, 2023 · 4 comments · Fixed by #3437
Labels
bug parquet Changes to the parquet crate

Comments

@ByteBaker
Copy link
Contributor

Describe the bug
While using parquet::file::reader::SerializedFileReader, calling to_json_value on a parquet Row, containing a date before 1970/01/01 panics.

A similar issue #2897 was reported earlier and closed after #2899 solved it. But it only handled Field::TimestampMillis and Field::TimestampMicros types, Field::Date still breaks.

To Reproduce

Inside test parquet::record::api::tests::test_convert_date_to_string, add an extra call to function check_date_conversion with any date before 1970/01/01. Another way is to use a parquet file with at least one Row containing such value.

This panics with the message

thread 'record::api::tests::test_convert_date_to_string' panicked at 'No such local time'

Expected behavior

Be able to call Row::to_json_value without panic.


Additional context

Why #2899 works:
enum Field defines Date, TimestampMillis, TimestampMicros fields as:

pub enum Field {
   ...
   Date(u32),
   TimestampMillis(u64),
   TimestampMicros(u64),
   ...
}

When i32 or i64 are interpreted as their unsigned counterparts, the value underflows due to the now lack of a sign bit, e.g., -7766_i32 becomes 4294959530_u32.

When the new value is interpreted as its signed variant again, we get the original value (since the bytes never changed). Therefore assert_eq!(4294959530_u32 as i32, -7766_i32) succeeds.

#2899 utilizes this fact to solve the problem. But the same code doesn't resolve it for Field::Date in the below snippet, because it holds a u32, and promoting it an i64 changes the value itself (since it treats the MSB of the u32 for value, and not sign, which was crucial). Below code doesn't work.

fn convert_date_to_string(value: u32) -> String {
    static NUM_SECONDS_IN_DAY: i64 = 60 * 60 * 24;
    let dt = Utc
        .timestamp_opt(value as i64 * NUM_SECONDS_IN_DAY, 0)
        .unwrap();
    format!("{}", dt.format("%Y-%m-%d %:z"))
}

The correct way would be to read value to i32 before promoting it to i64. Therefore below code works.

fn convert_date_to_string(value: u32) -> String {
    static NUM_SECONDS_IN_DAY: i64 = 60 * 60 * 24;
    let dt = Utc
        .timestamp_opt(value as i32 as i64 * NUM_SECONDS_IN_DAY, 0)
        .unwrap();
    format!("{}", dt.format("%Y-%m-%d %:z"))
}

Though my personal opinion is to change the type definition of Field and use the signed integer types for Date, TimestampMillis and TimestampMicros, because we can always have dates/times before Unix epoch. Thus, negative timestamps are required.

@ByteBaker ByteBaker added the bug label Jan 2, 2023
@tustvold
Copy link
Contributor

tustvold commented Jan 3, 2023

I would be happy with either approach, changing the definition of field to be signed to match thr parquet datatypes makes a lot of sense to me

@tustvold tustvold changed the title Parquet API: Cannot convert date before Unix epoch to json Parquet Record API: Cannot convert date before Unix epoch to json Jan 3, 2023
@ByteBaker
Copy link
Contributor Author

Great. I thought the same. Will write the code and commit in a few hours.

BTW another question that I forgot to ask above. Does it make sense to have a timezone while displaying just the date? The current implementation does it through format!("{}", dt.format("%Y-%m-%d %:z")), but since the date is always initialized from Utc, the timezone will always be +00:00.

@tustvold
Copy link
Contributor

tustvold commented Jan 3, 2023

Does it make sense to have a timezone while displaying just the date

I don't think so

@ByteBaker
Copy link
Contributor Author

Okay then I'll just remove it from the string representation as well within this commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants