Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle conflict checking in optimize correctly #2208

Merged
merged 8 commits into from
Mar 23, 2024

Conversation

emcake
Copy link
Contributor

@emcake emcake commented Feb 24, 2024

Description

This removes the optimize update() before commit behaviour.

When digging, I discovered that a z-order after a merge would cause corrupted commits:

https://gist.github.com/emcake/4edfb72d77e08e8a600b8c0c902e2718

This should be prevented, as I'd expect a merge to remove files and for the conflict checker to kick in and prevent to z-order from going through. On digging I found that the conflict checker never came into play because of the call to update() before commit: https://github.com/delta-io/delta-rs/blob/main/crates/core/src/operations/optimize.rs#L738

This should have been caught by tests, but the test for conflict checking was been ignored since it was written: https://github.com/delta-io/delta-rs/blob/main/crates/core/tests/command_optimize.rs#L261

It looks like removing the update passes all tests and allows the conflict checking test to be added back in too. This causes one minor dilemma for long-running optimizes that use the min commit interval parameter - due to the way that commit works, if there is no updating then after there had been 15 intermediate commits it would fail. I've changed it to use commit_with_retries and it now accounts for the commits it's made in the retry count.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 24, 2024
@ion-elgreco ion-elgreco force-pushed the fix-optimize-conflicts branch from 72b150a to 72b0090 Compare February 24, 2024 13:33
@emcake
Copy link
Contributor Author

emcake commented Feb 24, 2024

Note that this is a small regression in behaviour for long-running optimise calls, as they can now error out after 15 other commits on the delta table have happened. I think there might be a better way to do it where you selectively update the table as long as it won’t conflict with your optimise operation, but IMO the short term payoff of not having corrupted tables on optimise is worth the regression in the short term.

@rtyler rtyler marked this pull request as draft February 24, 2024 18:20
@emcake emcake marked this pull request as ready for review February 25, 2024 10:21
@rtyler rtyler enabled auto-merge (rebase) February 25, 2024 14:41
@rtyler
Copy link
Member

rtyler commented Feb 25, 2024

That line was recently changed in this commit from update_incremental to update but it was in a monster refactor that @roeap was doing.

I don't see a problem with the case as you've laid it out here @emcake , but I would like him to chime in here in case there was a strong reason for that line to exist in the optimize

@@ -739,18 +740,18 @@ impl MergePlan {
app_metadata.insert("operationMetrics".to_owned(), map);
}

table.update().await?;
Copy link
Collaborator

@Blajda Blajda Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing the update completely and forcing users to update their table we can move this update to after the commit is performed. If the commit is successful then we know there are no conflicts.

auto-merge was automatically disabled March 19, 2024 10:12

Head branch was pushed to by a user without write access

@emcake
Copy link
Contributor Author

emcake commented Mar 19, 2024

@Blajda I've merged in the CommitBuilder changes and have added an update on the end of the optimize.

@rtyler any further thoughts on merging this? It's not currently safe to use destructive operations (like merge or delete) and optimize on the same table on the same partitions.

@Blajda
Copy link
Collaborator

Blajda commented Mar 21, 2024

I've merged in the CommitBuilder changes and have added an update on the end of the optimize.

To clarify I was thinking of adding the update() call to after the call to commit instead of before. Also when we update to should only advance to the version returned by the last commit call.

@emcake
Copy link
Contributor Author

emcake commented Mar 21, 2024

I've merged in the CommitBuilder changes and have added an update on the end of the optimize.

To clarify I was thinking of adding the update() call to after the call to commit instead of before. Also when we update to should only advance to the version returned by the last commit call.

No, that wouldn’t be correct behaviour. Consider a table with two data files, A and B, on version 1.

imagine we start an optimize with multiple commits, and a merge at the same time.

  1. The optimize starts, capturing the current state of file A and file B.
  2. The merge finishes, commits version 2. The merge deletes B and adds C.
  3. The first optimize block finishes, and attempts to commit. Let’s say that the optimize only affects file A - it deletes A and inserts D. Conflict checking should check that version 2 (which has appeared) doesn’t conflict with the changes, which is true as v2 only changes B. This makes version 3.
  4. If you update the table at this point, the table state changes to Version 3 containing files C and D.
  5. The second optimize block finishes, and attempts to delete B and add E. Conflict checking doesn’t run as the table is up to date. Version 4 now contains files C, D and E. C and E contain duplicated data from the merge and the optimize.

If you don’t do step (4) then the second optimize commit is compared to version 1, and will detect a conflict against the proposed changes. In general it’s not safe to update in the middle of an optimize operation, because the state of the table is captured at the start of the operation instead of at the point of commit.

A safer version of optimize could pre-plan a series of optimize commits (eg one per partition, or n partitions per commit) and then for each partial optimize:

  1. Gather the data, perform the optimize operation, create the files
  2. Commit just those changes as an atomic commit
  3. If the commit succeeded, mark that potential commit as done. If not, add it back on to the queue of partial optimised to perform
  4. Update the table to gather the latest state (Ie before the next gather-data operation)

@emcake
Copy link
Contributor Author

emcake commented Mar 21, 2024

I think the planned partial optimize is a potentially useful operation that can be both long-running and safe, but I think it’s outside the scope of this PR (which is about ensuring correctness when running concurrently optimizer with destructive operations)

Copy link
Collaborator

@Blajda Blajda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your above analysis and that if we want support a progressive optimize an alternative approach is required.

@Blajda Blajda enabled auto-merge (squash) March 23, 2024 02:22
@Blajda Blajda merged commit f49eedb into delta-io:main Mar 23, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants