Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeltaTable Merge throws in merging if there are uppercase in Schema. #1797

Closed
mohitbansal-gep opened this issue Nov 3, 2023 · 12 comments · Fixed by #1954
Closed

DeltaTable Merge throws in merging if there are uppercase in Schema. #1797

mohitbansal-gep opened this issue Nov 3, 2023 · 12 comments · Fixed by #1954
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@mohitbansal-gep
Copy link

mohitbansal-gep commented Nov 3, 2023

Environment

Delta-rs version: deltalake==0.12.0

Binding:

Environment:

  • Cloud provider:
  • OS:
  • Other: python

Bug

What happened:
If deltatable schema has attribute with upper case in them, then the merge throws Generic deltatable error stating field not found
What you expected to happen:
The Deltatable merge operation should have been performed.
How to reproduce it:
Sample code:-

from deltalake import DeltaTable, write_deltalake
from datetime import datetime
import polars as pl

df = pl.DataFrame(
    {
        "Sales_order_id": ["1000", "1001", "1002", "1003"],
        "product": ["bike", "scooter", "car", "motorcycle"],
        "order_date": [
            datetime(2023, 1, 1),
            datetime(2023, 1, 5),
            datetime(2023, 1, 10),
            datetime(2023, 2, 1),
        ],
        "sales_price": [120.25, 2400, 32000, 9000],
        "paid_by_customer": [True, False, False, True],
    }
)
print(df)
df.write_delta("sales_orders_old", mode="append")

new_data = pl.DataFrame(
    {
        "Sales_order_id": ["1002", "1004"],
        "product": ["car", "car"],
        "order_date": [datetime(2023, 1, 10), datetime(2023, 2, 5)],
        "sales_price": [30000.0, 40000.0],
        "paid_by_customer": [True, True],
    }
)

from polars.io.delta import _convert_pa_schema_to_delta

dt = DeltaTable("sales_orders_old")
source = new_data.to_arrow()
delta_schema = _convert_pa_schema_to_delta(source.schema)
source = source.cast(delta_schema)

(
    dt.merge(
        source=source,
        predicate="s.Sales_order_id = t.Sales_order_id",
        source_alias="s",
        target_alias="t",
    )
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute()
)

This will throw:- DeltaError: Generic DeltaTable error: Schema error: No field named s.sales_order_id.
More details:

Referenced from this blog post https://delta.io/blog/2023-10-22-delta-rs-python-v0.12.0/

@mohitbansal-gep mohitbansal-gep added the bug Something isn't working label Nov 3, 2023
@mohitbansal-gep
Copy link
Author

@rtyler i can work on this issue if required, But i need some guidance on how to load the project and initial setup.

@ion-elgreco
Copy link
Collaborator

@mohitbansal-gep if you want to contribute, you can follow the guide here: https://github.com/delta-io/delta-rs/blob/main/python/CONTRIBUTING.md

@Blajda
Copy link
Collaborator

Blajda commented Nov 4, 2023

@mohitbansal-gep I you wrap those column with double quotes it should resolve the issue.
e.g predicate="\"s.Sales_order_id\" = \"t.Sales_order_id\""

@mohitbansal-gep
Copy link
Author

mohitbansal-gep commented Nov 6, 2023

@Blajda Tried above but same issue.

DeltaError: Generic DeltaTable error: Schema error: No field named "s.Sales_order_id".

@mohitbansal-gep
Copy link
Author

@Blajda any more info?

@ArthurSteijn
Copy link

Same here. If you change the predicate of the merge, you will get the same error on a different Column.

@ion-elgreco what would be a fair direction for a fix here? Should the merge be able to handle table schemas with upper casing? (that definitely has my preference)

Schema:
Schema([Field(CustomerID, PrimitiveType("long"), nullable=True), Field(NameStyle, PrimitiveType("long"), nullable=True), Field(Title, PrimitiveType("string"), nullable=True), Field(FirstName, PrimitiveType("string"), nullable=True), Field(MiddleName, PrimitiveType("string"), nullable=True), Field(LastName, PrimitiveType("string"), nullable=True), Field(Suffix, PrimitiveType("string"), nullable=True), Field(CompanyName, PrimitiveType("string"), nullable=True), Field(SalesPerson, PrimitiveType("string"), nullable=True), Field(EmailAddress, PrimitiveType("string"), nullable=True), Field(Phone, PrimitiveType("string"), nullable=True), Field(PasswordHash, PrimitiveType("string"), nullable=True), Field(PasswordSalt, PrimitiveType("string"), nullable=True)])
Merge:

source = source_df.to_arrow()
delta_schema = _convert_pa_schema_to_delta(source.schema)
source = source.cast(delta_schema)
(
    dt.merge(
        source=source,
        predicate='s."CustomerID" = t."CustomerID"',
        source_alias="s",
        target_alias="t",
    )
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute()
)

Error:
DeltaError: Generic DeltaTable error: Schema error: No field named s.passwordsalt. Valid fields are s."CustomerID", s."NameStyle", s."Title", s."FirstName", s."MiddleName", s."LastName", s."Suffix", s."CompanyName", s."SalesPerson", s."EmailAddress", s."Phone", s."PasswordHash", s."PasswordSalt", s.__delta_rs_source, t."CustomerID", t."NameStyle", t."Title", t."FirstName", t."MiddleName", t."LastName", t."Suffix", t."CompanyName", t."SalesPerson", t."EmailAddress", t."Phone", t."PasswordHash", t."PasswordSalt", t.__delta_rs_target.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Nov 7, 2023

@ArthurSteijn
Let me see if I can work on a fix for the update_all, insert_all.

Also fyi, _convert_pa_schema_to_delta behavour will change in next Polars version since it won't return normal arrow types now, but large. I am planning on porting the old behaviour into deltalake directly so we can use it for multiple use case (like pandas conversion and for the merge).

@Blajda
Copy link
Collaborator

Blajda commented Nov 7, 2023

@ion-elgreco I have a potential fix at the rust level for this which should not require any python related changes. Will upload the branch later today and add new test to confirm.

@ion-elgreco
Copy link
Collaborator

@ion-elgreco I have a potential fix at the rust level for this which should not require any python related changes. Will upload the branch later today and add new test to confirm.

Ah nice, even better!

@mohitbansal-gep
Copy link
Author

@Blajda any updates?

@Blajda
Copy link
Collaborator

Blajda commented Nov 11, 2023

I think a fix for this will come after the refactor to logical plans are merged. This should come soonish just waiting on a new datafusion release.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Nov 30, 2023

@Blajda I checked against the latest main, however the issue still persist with logical plans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants