Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MERGE works incorrectly with partitioned table if the data column order is not same as table column order #1787

Closed
ion-elgreco opened this issue Oct 30, 2023 · 2 comments · Fixed by #1789
Assignees
Labels
bug Something isn't working

Comments

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Oct 30, 2023

Environment

Delta-rs version: 0.12.0

Binding: Python


Bug

What happened: When you do a merge on a table that has partition columns in a different order than the source data it will cause the wrong values to be updated in the partition columns.

What you expected to happen:
Update the columns correctly irrespective of the order.

How to reproduce it:
Write table first.

import polars as pl
from polars.io.delta import _convert_pa_schema_to_delta
from deltalake import DeltaTable, write_deltalake

df = pl.DataFrame({
    "foo": ['foo_value'],
    "bar": ['bar_value'],
    'value': [10]
})

arrow_data = df.to_arrow()
arrow_data = arrow_data.cast(_convert_pa_schema_to_delta(arrow_data.schema))
write_deltalake("test_table", arrow_data,  partition_by=['bar','foo'])

Merge the result, the partition column order is bar foo. Data order is foo & `bar

df_merge = pl.DataFrame({
    "foo": ['foo_value'],
    "bar": ['bar_value'],
    'value': [20]
})

arrow_data_merge = df_merge.to_arrow()
arrow_data_merge = arrow_data_merge.cast(_convert_pa_schema_to_delta(arrow_data_merge.schema))

dt = DeltaTable('test_table')

dt.merge(arrow_data_merge, 
         predicate='s.value = t.value', 
         source_alias='s', target_alias='t').when_not_matched_insert_all().execute()

The result is incorrect as you can see:

dt = DeltaTable('test_table')

print(pl.from_arrow(dt.to_pyarrow_table()))

┌───────────┬───────────┬───────┐
│ foobarvalue │
│ ---------   │
│ strstri64   │
╞═══════════╪═══════════╪═══════╡
│ bar_valuefoo_value10    │
│ foo_valuebar_value20    │
└───────────┴───────────┴───────┘
@ion-elgreco ion-elgreco added the bug Something isn't working label Oct 30, 2023
@ion-elgreco ion-elgreco changed the title Merge works incorrectly with partitioning data if the column order is not same as the partition column order Merge works incorrectly with partitioned table if the data column order is not same as table column order Oct 30, 2023
@ion-elgreco ion-elgreco changed the title Merge works incorrectly with partitioned table if the data column order is not same as table column order MERGE works incorrectly with partitioned table if the data column order is not same as table column order Oct 30, 2023
@MrPowers
Copy link
Contributor

@Blajda - would you like me to assign this to you?

@Blajda
Copy link
Collaborator

Blajda commented Oct 31, 2023

@MrPowers Sure you can assign me. Would be nice to have permissions to self assign since most I'll handle issue related to merge.

wjones127 pushed a commit that referenced this issue Nov 4, 2023
# Description
Sometimes the order of partition columns in our delta schema does not
match the order of partition columns in the deltatable metadata.
This would cause `DeltaScan` to provide incorrect values for partition
columns.
This is fixed by having `DeltaScan` use the metadata as the source of
truth.

# Related Issue(s)
- closes #1787
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants