Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add a reproduction case for merge failures with struct<string> #2644

Merged
merged 3 commits into from
Jul 3, 2024

Conversation

rtyler
Copy link
Member

@rtyler rtyler commented Jul 3, 2024

E       _internal.DeltaError: Generic DeltaTable error: type_coercion
E       caused by
E       Error during planning: Failed to coerce then ([Struct([Field { name: "city", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "state", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "street", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), Struct([Field { name: "city", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "state", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "street", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), Struct([Field { name: "city", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "state", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "street", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), Struct([Field { name: "city", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "state", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "street", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), Struct([Field { name: "city", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "state", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "street", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])]) and else (None) to common types in CASE WHEN expression

@github-actions github-actions bot added the binding/python Issues for the Python package label Jul 3, 2024
@rtyler
Copy link
Member Author

rtyler commented Jul 3, 2024

@ion-elgreco this test can pass by passing a non-default large_dtypes=False to the merge() function. In resolving the original issue (#1998) with #2003 you alluded to better type coercion support in Datafusion.

This test is a simple reproduction case of the paper cut hit by a deltalake user. It was not clear to the user why deltalake enlarges the type of a pandas.DataFrame in a way that breaks a simple merge in the default case. I'm inclined to agree, but I'm not sure what the solution should be here.

Changing the default value of ``large_dtypes=False` results in this, and all the other tests passing 😆 I believe that the upcast/enlarging should not be the default behavior, since it clearly breaks such a simple example, but I don't have a good enough understanding of why the upcast was done by default in the first place 😕

@ion-elgreco
Copy link
Collaborator

@rtyler I don't recall why this was there. But we should ideally drop all this large dtypes logic and improve the schema checking so that it's done on delta schema level and not arrow.

Also the cast kernels need to be improved so that you don't cast large utf8 to utf8 or utf8-view to utf8...

If we do it like that then all issues should be resolved I believe

rtyler added 2 commits July 3, 2024 13:44
Automatic upcasting of the data type is surprising and unexpected for
users. IMHO we should not be casting data types unless required
@rtyler rtyler force-pushed the struct-coercion branch from af72b43 to cd0ae1a Compare July 3, 2024 13:45
@rtyler
Copy link
Member Author

rtyler commented Jul 3, 2024

@ion-elgreco what are your thoughts on changing the default behavior of the function? In this reproduction case the source table is String, the source dataframe is Utf8/String and it is only the default behavior of the deltalake package that is turning anything into Largeutf8 🤦

@ion-elgreco
Copy link
Collaborator

@rtyler not sure why I left on true, probably by accident. I am fine with changing it to False, but we are just hiding the problem now : P

Perhaps an idea would be to have SOURCE record batch arrow schema used to read the target files with. Let's say we have this table_schema {"A": "string", "B": "string"}, in our code it changes the delta schema to arrow schema with non large types always, so {"A": 'utf8', "B":"utf8"}, however the incoming data might be in {"A": 'large-utf8', "B":"utf8view"} or any other combination.

So perhaps the solution to all this, would be that delta scans are done with the record batch arrow schema of the incoming data or are casted to that schema, so that source and target is in the same arrow schema. @rtyler what do you think?

@rtyler rtyler marked this pull request as ready for review July 3, 2024 16:18
@rtyler rtyler enabled auto-merge (rebase) July 3, 2024 16:18
@rtyler rtyler merged commit 9370151 into delta-io:main Jul 3, 2024
21 checks passed
@rtyler rtyler deleted the struct-coercion branch July 3, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants