Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File options are ignored when writing delta #1444

Closed
Dammi87 opened this issue Jun 6, 2023 · 5 comments
Closed

File options are ignored when writing delta #1444

Dammi87 opened this issue Jun 6, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@Dammi87
Copy link

Dammi87 commented Jun 6, 2023

Environment

Windows 10
Python 3.10.11

Delta-rs version:
deltalake 0.9.0
pyarrow 12.0.0
numpy 1.24.3


Bug

What happened:
I'm receiving json data from a service which is using nanosecond resolution which I need to store in delta format. It's acceptable to have truncated timestamps so I intended to simply allow that and coerce the timestamps to microsecond resolution. However, I end up with this error

PyDeltaTableError: Schema error: Invalid data type for Delta Lake: Timestamp(Nanosecond, Some("UTC"))

What you expected to happen:
I expected the timestamp to be truncated and converted to microseconds.

How to reproduce it:

import io
import pyarrow.json as pj
from deltalake.writer import write_deltalake
from pyarrow.dataset import ParquetFileFormat

def get_obj(content) :
    output = '\n'.join(json.dumps(d) for d in content)
    return io.BytesIO(output.encode())

def arrow(schema, content):
    return pj.read_json(
        get_obj(content),
        parse_options=pj.ParseOptions(
            explicit_schema=schema
        )
    )

content = [
    {'timeStamp': "2022-12-28T00:00:00.3352264Z"},
    {'timeStamp': "2022-12-28T00:00:00.3352264Z"}
]
schema = pa.schema([pa.field('timeStamp', pa.timestamp('ns', tz='UTC'))])
table = arrow(schema, content)

write_options = ParquetFileFormat().make_write_options(use_deprecated_int96_timestamps = False, coerce_timestamps = 'us', allow_truncated_timestamps = True)
write_deltalake('test', table, file_options=write_options)

More details:
This is a minimal producible example from the pipeline I'm creating - receiving a stream of json arrays

@Dammi87 Dammi87 added the bug Something isn't working label Jun 6, 2023
@wjones127
Copy link
Collaborator

Right now we expect users to cast their data types to ones Delta Lake supports. We may eventually support automatically casting in the future. That's tracked by #686

@Dammi87
Copy link
Author

Dammi87 commented Jun 6, 2023

Gotcha thanks!

I was aware of the limitation but the only unsupported data-type I was encountering was this damn timestamp, so I hoped that the file_options would save me the work :)

Should I close the issue then?

@wjones127
Copy link
Collaborator

Yeah sorry those truncation options don't work for that. I think we'd like to fold this into the general issue for mapping data types though, rather than treat timestamps specially.

@Dammi87
Copy link
Author

Dammi87 commented Jun 6, 2023

No worries, you guys are doing awesome work, much appreciated

@Dammi87 Dammi87 closed this as completed Jun 6, 2023
@neo4py
Copy link

neo4py commented Mar 12, 2024

Gotcha thanks!

I was aware of the limitation but the only unsupported data-type I was encountering was this damn timestamp, so I hoped that the file_options would save me the work :)

Should I close the issue then?

I am getting the same error, but I did not follow what the fix is, can you please clarify? thanks!
original_value = "2024-03-11T14:31:32.804589Z"
I converted it to datetime.fromisoformat(original_value)
I am using this as a column in pandas daatframe and when i print the datatype it shows datetime64[ns, UTC]
Also, I am building pyarrow schema from this pandas dataframe and pass it to the write_deltalake function. When I print the datatype from pyarrow it shows timestamp[ns, tz=UTC]
I have tried truncating the seconds altogether before creating the pandas dataframe, but to no avail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants