-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle conflict checking in optimize correctly #2208
Conversation
72b150a
to
72b0090
Compare
Note that this is a small regression in behaviour for long-running optimise calls, as they can now error out after 15 other commits on the delta table have happened. I think there might be a better way to do it where you selectively update the table as long as it won’t conflict with your optimise operation, but IMO the short term payoff of not having corrupted tables on optimise is worth the regression in the short term. |
That line was recently changed in this commit from I don't see a problem with the case as you've laid it out here @emcake , but I would like him to chime in here in case there was a strong reason for that line to exist in the optimize |
@@ -739,18 +740,18 @@ impl MergePlan { | |||
app_metadata.insert("operationMetrics".to_owned(), map); | |||
} | |||
|
|||
table.update().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of removing the update completely and forcing users to update their table we can move this update to after the commit is performed. If the commit is successful then we know there are no conflicts.
Head branch was pushed to by a user without write access
To clarify I was thinking of adding the |
No, that wouldn’t be correct behaviour. Consider a table with two data files, A and B, on version 1. imagine we start an optimize with multiple commits, and a merge at the same time.
If you don’t do step (4) then the second optimize commit is compared to version 1, and will detect a conflict against the proposed changes. In general it’s not safe to update in the middle of an optimize operation, because the state of the table is captured at the start of the operation instead of at the point of commit. A safer version of optimize could pre-plan a series of optimize commits (eg one per partition, or n partitions per commit) and then for each partial optimize:
|
I think the planned partial optimize is a potentially useful operation that can be both long-running and safe, but I think it’s outside the scope of this PR (which is about ensuring correctness when running concurrently optimizer with destructive operations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your above analysis and that if we want support a progressive optimize an alternative approach is required.
Description
This removes the optimize
update()
before commit behaviour.When digging, I discovered that a z-order after a merge would cause corrupted commits:
https://gist.github.com/emcake/4edfb72d77e08e8a600b8c0c902e2718
This should be prevented, as I'd expect a merge to remove files and for the conflict checker to kick in and prevent to z-order from going through. On digging I found that the conflict checker never came into play because of the call to update() before commit: https://github.com/delta-io/delta-rs/blob/main/crates/core/src/operations/optimize.rs#L738
This should have been caught by tests, but the test for conflict checking was been ignored since it was written: https://github.com/delta-io/delta-rs/blob/main/crates/core/tests/command_optimize.rs#L261
It looks like removing the update passes all tests and allows the conflict checking test to be added back in too. This causes one minor dilemma for long-running optimizes that use the min commit interval parameter - due to the way that commit works, if there is no updating then after there had been 15 intermediate commits it would fail. I've changed it to use
commit_with_retries
and it now accounts for the commits it's made in the retry count.