Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor vacuum and implement a builder #672

Merged
merged 3 commits into from
Jul 7, 2022

Conversation

Blajda
Copy link
Collaborator

@Blajda Blajda commented Jul 4, 2022

Description

Factor out vacuum errors and operation into it's own file and use a builder with reasonable defaults. No changes in functionality except the user can provide a Duration when specifying the retention period. This brings us in line with the reference implementation which allows for time periods less than a hour.

This is a breakdown of #669 to only include the builder functionality.

@wjones127 wjones127 self-requested a review July 5, 2022 04:40
@roeap roeap self-requested a review July 5, 2022 18:08
wjones127
wjones127 previously approved these changes Jul 6, 2022
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had one minor suggestion. Otherwise this looks great!

Comment on lines +1187 to +1189
let mut plan = Vacuum::default()
.dry_run(dry_run)
.enforce_retention_duration(enforce_retention_duration);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that API! 😄

@@ -0,0 +1,217 @@
//! Vacuum a Delta table
//!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we should probably turn wrap_comments = true (in a separate PR).

rust/src/vacuum.rs Show resolved Hide resolved
/// See this module's documentation for more information
pub struct Vacuum {
/// Period of stale files allowed.
pub retention_period: Option<Duration>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making this a duration 🙌

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Blajda - excellent work! I left some comments for discussion, but am not sure how deep those changes would run.

When proposing #661 I was hoping that we can move away from passing the entire table around to operations, but rather a snapshot of the table (DeltaTableState) and the object store, which may eventually be encapsulated in the mentioned dedicated DeltaLog. I believe that this would open us up to a bunch more clean-up and optimizations down the road, and would also simplify the implementation of #632. This would also much more mirror the designs found the in reference implementation.

My main concern with using just the state is, that we really don't want to clone that as it can be very expensive (maybe wrap in Arc?)

I haven't had the time to think about the implications this properly but maybe this PR is an opportunity to look into some of this?

Of course very curious what other think about this - @houqp @wjones127?


/// Errors that can occur during vacuum
#[derive(thiserror::Error, Debug)]
pub enum VacuumError {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really introduced in this PR, but I am getting a little concerned about the number of different error enums we have in this crate. I think eventually we should think about consolidation (maybe one enum for operations) or start to introduce something like how std::io::Error is modelled unsing a generic error which has an ErroKind...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah creating appropriate error types is harder than the borrow checker at times 😆

Comment on lines +209 to +222
return Ok(VacuumMetrics {
files_deleted: plan.files_to_delete,
dry_run: true,
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent - returning metrics like this is exactly the right thing to do. In #673 this is what's needed to do proper conflict resolution. There we need to know all files read by a transaction. There is a CurrentTransactionInfo struct in that PR that holds all we want to know about the current transaction. I see now that this does not care about files removed from the table. but while the complete transaction info object is maybe not suitable here, maybe we want a more generic Metrics struct, that can also report on other metrics?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if a generic Metrics struct would really work in Rust but I think it would be a good place for traits. For example, tracking the time an operation took is common so a TimeCostMetric trait could be used to obtain that. Ditto for conflict resolution. A trait that specifies all the information required can be defined and then each operation is required to implement it.

Ok(_) => Ok(files_to_delete),
Err(err) => Err(DeltaTableError::StorageError { source: err }),
}
let res = plan.execute(self).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my hopes in proposing #661 was that we could eventually return the builder and arrive at an API like.

let dt = open_table(...);
let result = dt.vacuum().enforce_retention_duration(true).into_future().await;

once the IntoFuture trait lands, this would simplify to.

let dt = open_table(...);
let result = dt.vacuum().enforce_retention_duration(true).await;

This would also mean we don't have to have vacuum itself be async anymore, but requires that the VacuumPlan has everything it needs to execute. I left some thoughts in to top level comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if DeltaLog was implemented the into_future() change would be trivial. I'll play on a nightly branch and see how hard it would be without the log.

@Blajda
Copy link
Collaborator Author

Blajda commented Jul 6, 2022

When proposing #661 I was hoping that we can move away from passing the entire table around to operations, but rather a snapshot of the table (DeltaTableState) and the object store, which may eventually be encapsulated in the mentioned dedicated DeltaLog

Yes I fully agree with this. For this operation only the delta log at a particular version is required and since it's a snapshot isolate we don't have to worry about conflicts.

When it comes to avoiding clones, I don't think just having the state in Arc would solve it since you would need to clone the state to put it into an Arc which then prevent writes. Naively you could use a linked list where each node points to the previous version and then pass around the 'head' of the list. Then operations can write to the log without impacting the view of others and reconcile/fail when they read the ground truth. The primary challenge here would be with Rust's ownership model but I feel like there's a functional data struct that would solve and it would definitely require some reference counting. I might be completely off here >.<

@Blajda
Copy link
Collaborator Author

Blajda commented Jul 7, 2022

Hi @wjones127 and @roeap
Appreciate your comments and insights that you shared in this PR. I have updated the module description with a bit on time travel, added an example, and resolved the merge conflict.

@Blajda Blajda requested a review from roeap July 7, 2022 01:03
@wjones127 wjones127 merged commit d10b428 into delta-io:main Jul 7, 2022
@wjones127
Copy link
Collaborator

Thanks @Blajda!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants