Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_deltalake not respecting writer_properties #2064

Closed
nholt01 opened this issue Jan 10, 2024 · 3 comments
Closed

write_deltalake not respecting writer_properties #2064

nholt01 opened this issue Jan 10, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@nholt01
Copy link

nholt01 commented Jan 10, 2024

Environment

Delta-rs version:

Binding: python 0.15.1

Environment:

  • Cloud provider:
  • OS: Windows 11
  • Other:

Bug

What happened:
I have a call to write_deltalake that successfully writes out a delta table, but it doesn't seem to respect what is input through the writer_properties parameter. In this case, I'm wanting to specify that it used ZSTD level 3 compression, but the output does not get compressed at all (and there is no .zstd.parquet file extensions).

What you expected to happen:
The files produced should be compressed with ZSTD level 3 compression. All other parameters in the code snippet below are confirmed to be working, including partitioning.

How to reproduce it:

final_df = polars.DataFrame({
    "a": [1, 2, 3, 4],
    "b": [5, 6, 7, 8],
    "c": [random.randrange(0, 3) for i in range(4)]
})

final_df_arrow = final_df.to_arrow()
deltalake.write_deltalake(
    "output_table_from_deltalake",
    final_df_arrow,
    engine="pyarrow",
    schema=Schema.from_pyarrow(final_df_arrow.schema),
    mode="append",
    writer_properties=WriterProperties(
        compression="ZSTD",
        compression_level=3
    ),
    partition_by="c"
)

More details:

@nholt01 nholt01 added the bug Something isn't working label Jan 10, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jan 10, 2024

@nholt01 the WriterProperties are only used when you write with the rust engine. The language server should have hinted to you that WriterProperties with engine='pyarrow' is not allowed since there is not an overload for that. See here: https://github.com/delta-io/delta-rs/blob/f7c303b74218c202ef683f727701a67da5aaaca5/python/deltalake/writer.py#L108C1-L133C11

You can either write with WriterProperties using engine='rust' or you pass the ds.ParquetFileWriteOptions to file_options when you also set engine='pyarrow' :)

@nholt01
Copy link
Author

nholt01 commented Jan 10, 2024

@ion-elgreco I apologize; you're absolutely correct. I switched to using the Rust engine and it all seems to be working now. Thanks for your help!

Out of curiosity: is there a "preference" of which engine to use for insertion performance?

@nholt01 nholt01 closed this as completed Jan 10, 2024
@ion-elgreco
Copy link
Collaborator

@nholt01 the normal write without partitioning with rust engine I saw sometimes 3-4x faster writes.

With partitioning I saw similar speeds between pyarrow and rust engine.

I suggest to use the rust engine writer, this will be the way forward for the library and all the newer protocol versions will only supported with that one (for example constraints).

The only thing that is missing is predicate overwrite in rust but this is on its way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants