Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow engine not supporting schema overwrite with Append mode #2654

Closed
gprashmi opened this issue Jul 8, 2024 · 6 comments
Closed

Pyarrow engine not supporting schema overwrite with Append mode #2654

gprashmi opened this issue Jul 8, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@gprashmi
Copy link

gprashmi commented Jul 8, 2024

Environment

Delta-rs version: 0.17.4

Bug

We have a pandas df with 66 columns that is being written to a delta table with a pre-defined schema in pyarrow engine and now we have a new df with 67 columns, and the schema_mode = overwrite is not supported with 'append' mode.

Below is an basic example df of updating table schema with pyarrow engine and append mode: and the error we see while updating table schema

# Define the new schema with additional columns
new_columns = [pa.field('new_column', pa.float64())]

# Read the existing Delta table
delta_table = DeltaTable(delta_table_path)

# Get the schema of the existing table
existing_schema = delta_table.schema().to_pyarrow()
# Combine the existing schema fields with the new columns
updated_schema = pa.schema(list(existing_schema) + new_columns)

# Create a new DataFrame with the updated schema and add data
updated_data = {
    'column1': ['d', 'e', 'f'],
    'column2': [4, 5, 6],
    'dayhour': ['2024-07-01 03:00', '2024-07-01 04:00', '2024-07-01 05:00'],
    'new_column': [0.4, 0.5, 0.6]
}
updated_df = pd.DataFrame(updated_data)

# Convert the updated DataFrame to a PyArrow Table
updated_table = pa.Table.from_pandas(updated_df, schema=updated_schema)

# Write the initial Delta table
write_deltalake(
    table_or_uri=delta_table_path,
    data=updated_table.to_pandas(),
    schema=updated_schema,
    partition_by=[dayhour_partition_column],
    schema_mode='merge',
    mode="append",
    storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": "true"},
)

image

It would be great if we can have overwrite the existing schema of the delta table with append mode that does not effect the existing data. Can you please let me know if we can update schema while appending data?

@gprashmi gprashmi added the bug Something isn't working label Jul 8, 2024
@rtyler
Copy link
Member

rtyler commented Jul 8, 2024

👋 The schema_mode='merge' parameter is not, and will likely never be supported with the pyarrow engine. Is there a reason why the engine='rust' cannot be used? That's the directly we're trying to move towards.

Additionally I wanted to mention that:

    storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": "true"},

this options introduces table corruption risk if there are ever two processes which try to concurrently modify the same Delta table.

@rtyler rtyler self-assigned this Jul 8, 2024
@gprashmi
Copy link
Author

gprashmi commented Jul 8, 2024

The reason for not using engine='rust' is we want to have this delta table mapped to Trino (for Grafana integration). But the mapping between delta table with rust and Trino was not compatible and throwing the below error. So we had to move to Pyarrow with which we could map to Trino.

@gprashmi
Copy link
Author

gprashmi commented Jul 8, 2024

@g12-al

@gprashmi
Copy link
Author

gprashmi commented Jul 8, 2024

Also, you mentioned about storage_options introducing table corruption risk if there are ever two processes which try to concurrently modify the same Delta table
storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": "true"},

so we initially had optimize.compact() on the delta table and this was causing the partition URL encoding to have spaces randomly (opened a github issue for this: #2634), could the 'storage_options' be associated to this random URL encoding with spaces?

@gprashmi
Copy link
Author

@rtyler Any update on the above comment for using the pyarrow engine for writing data.

@ion-elgreco
Copy link
Collaborator

The reason for not using engine='rust' is we want to have this delta table mapped to Trino (for Grafana integration). But the mapping between delta table with rust and Trino was not compatible and throwing the below error. So we had to move to Pyarrow with which we could map to Trino.

We are going to deprecate the pyarrow engine eventually, so schema evolution won't be supported there.

If you are encountering issues with Trino when reading tables created by rust engine, then please create a separate issue with an MRE

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants