You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't have a good MRE, but errors are raised if multiple partitions are merged to in parallel. This does not happen if we use "overwrite" or "append" table writes, just when we .execute() a merge. For example, if I pass the unique partitions which were updated based on the transaction log to dynamic Airflow tasks (each task processes a single partition) then only one of the tasks will succeed. Other tasks will fail:
File "/home/airflow/.local/lib/python3.11/site-packages/deltalake/table.py", line 1597, in execute
metrics = self.table._table.merge_execute(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_internal.DeltaError: Generic DeltaTable error: Schema error: No field named s.unique_row_hash. Valid fields are
Example of an Airflow task which attempts to merge each of these mapped tasks which represent distinct partitions which should not have any conflict with each other. Right now I need to limit my schedule to only allow one partition to update at a time which slows down the pipeline a huge amount compared to using "overwrite" instead of merges.
What you expected to happen:
Since each partition is unique (for example, each partition is an individual date) they should be able to be written to in parallel. The error message about the column not existing is false as well. Clearing the task makes it succeed as well.
How to reproduce it:
Merge to a Delta Table using partition_filters where multiple partitions are written to in parallel.
More details:
The text was updated successfully, but these errors were encountered:
Environment
Delta-rs version:
0.15.3
Environment: GKE
Bug
What happened:
I don't have a good MRE, but errors are raised if multiple partitions are merged to in parallel. This does not happen if we use "overwrite" or "append" table writes, just when we
.execute()
a merge. For example, if I pass the unique partitions which were updated based on the transaction log to dynamic Airflow tasks (each task processes a single partition) then only one of the tasks will succeed. Other tasks will fail:Example of an Airflow task which attempts to merge each of these mapped tasks which represent distinct partitions which should not have any conflict with each other. Right now I need to limit my schedule to only allow one partition to update at a time which slows down the pipeline a huge amount compared to using "overwrite" instead of merges.
What you expected to happen:
Since each partition is unique (for example, each partition is an individual date) they should be able to be written to in parallel. The error message about the column not existing is false as well. Clearing the task makes it succeed as well.
How to reproduce it:
Merge to a Delta Table using partition_filters where multiple partitions are written to in parallel.
More details:
The text was updated successfully, but these errors were encountered: