Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing Tables with Append mode errors if the schema metadata is different #2419

Closed
evancurtin opened this issue Apr 15, 2024 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@evancurtin
Copy link

evancurtin commented Apr 15, 2024

Environment

Delta-rs version: 0.16.4

Binding: python

Environment:

  • Cloud provider: azure
  • OS: linux64

Bug

What happened: When writing to a delta table, the metadata is checked when evaluating if the table can be written. I am using unity catalog from databricks and it is storing field-level comments in the field metadata. When I am appending new data to an existing table that I wrote comments for, I cannot write the data because my local table does not have the comment information in the field metadata.

What you expected to happen: The data is appended if the fields and datatypes of the fields match. If schema mode is not overwrite, then the existing metadata is unchanged and the new data is added to the table.

How to reproduce it:

  • write a delta table with metadata
  • Try to write new data with mode="append" with a compatible local table
  • You will see ValueError: Schema of data does not match table schema

More details:
I believe the check occurs here, where if there are any differences at all in the schema there is an error. I would think if the field names/types match then the data could be written.

if table: # already exists
if sort_arrow_schema(schema) != sort_arrow_schema(
table.schema().to_pyarrow(as_large_types=large_dtypes)
) and not (mode == "overwrite" and schema_mode == "overwrite"):
raise ValueError(
"Schema of data does not match table schema\n"
f"Data schema:\n{schema}\nTable Schema:\n{table.schema().to_pyarrow(as_large_types=large_dtypes)}"
)

@evancurtin evancurtin added the bug Something isn't working label Apr 15, 2024
@ion-elgreco
Copy link
Collaborator

@evancurtin please share a reproducible example

@evancurtin
Copy link
Author

Sorry it seems that I was mistaken. When trying to reproduce I found out that the source of error was not this function. Would you be open to a PR updating the schema validation error message? It would be easier to debug in this scenario

        if table:  # already exists
            if sort_arrow_schema(schema) != sort_arrow_schema(
                table.schema().to_pyarrow(as_large_types=large_dtypes)
            ) and not (mode == "overwrite" and schema_mode == "overwrite"):
                table_schema = table.schema().to_pyarrow(as_large_types=large_dtypes)
                table_fields = set(zip(table_schema.names, table_schema.types))
                data_fields = set(zip(schema.names, schema.types))
                missing = table_fields - data_fields
                extra = data_fields - table_fields
                raise ValueError(
                    "Schema of data does not match table schema\n"
                    f"Missing: {missing}\n Extra: {extra}"
                )

@ion-elgreco
Copy link
Collaborator

@evancurtin feel free to open a PR :)

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants