-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Factor vacuum and implement a builder #672
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only had one minor suggestion. Otherwise this looks great!
let mut plan = Vacuum::default() | ||
.dry_run(dry_run) | ||
.enforce_retention_duration(enforce_retention_duration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that API! 😄
@@ -0,0 +1,217 @@ | |||
//! Vacuum a Delta table | |||
//! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we should probably turn wrap_comments = true
(in a separate PR).
/// See this module's documentation for more information | ||
pub struct Vacuum { | ||
/// Period of stale files allowed. | ||
pub retention_period: Option<Duration>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making this a duration 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Blajda - excellent work! I left some comments for discussion, but am not sure how deep those changes would run.
When proposing #661 I was hoping that we can move away from passing the entire table around to operations, but rather a snapshot of the table (DeltaTableState
) and the object store, which may eventually be encapsulated in the mentioned dedicated DeltaLog
. I believe that this would open us up to a bunch more clean-up and optimizations down the road, and would also simplify the implementation of #632. This would also much more mirror the designs found the in reference implementation.
My main concern with using just the state is, that we really don't want to clone that as it can be very expensive (maybe wrap in Arc
?)
I haven't had the time to think about the implications this properly but maybe this PR is an opportunity to look into some of this?
Of course very curious what other think about this - @houqp @wjones127?
|
||
/// Errors that can occur during vacuum | ||
#[derive(thiserror::Error, Debug)] | ||
pub enum VacuumError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really introduced in this PR, but I am getting a little concerned about the number of different error enums we have in this crate. I think eventually we should think about consolidation (maybe one enum for operations) or start to introduce something like how std::io::Error
is modelled unsing a generic error which has an ErroKind
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah creating appropriate error types is harder than the borrow checker at times 😆
return Ok(VacuumMetrics { | ||
files_deleted: plan.files_to_delete, | ||
dry_run: true, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent - returning metrics like this is exactly the right thing to do. In #673 this is what's needed to do proper conflict resolution. There we need to know all files read by a transaction. There is a CurrentTransactionInfo
struct in that PR that holds all we want to know about the current transaction. I see now that this does not care about files removed from the table. but while the complete transaction info object is maybe not suitable here, maybe we want a more generic Metrics struct, that can also report on other metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if a generic Metrics struct would really work in Rust but I think it would be a good place for traits. For example, tracking the time an operation took is common so a TimeCostMetric
trait could be used to obtain that. Ditto for conflict resolution. A trait that specifies all the information required can be defined and then each operation is required to implement it.
Ok(_) => Ok(files_to_delete), | ||
Err(err) => Err(DeltaTableError::StorageError { source: err }), | ||
} | ||
let res = plan.execute(self).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my hopes in proposing #661 was that we could eventually return the builder and arrive at an API like.
let dt = open_table(...);
let result = dt.vacuum().enforce_retention_duration(true).into_future().await;
once the IntoFuture
trait lands, this would simplify to.
let dt = open_table(...);
let result = dt.vacuum().enforce_retention_duration(true).await;
This would also mean we don't have to have vacuum itself be async anymore, but requires that the VacuumPlan
has everything it needs to execute. I left some thoughts in to top level comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if DeltaLog
was implemented the into_future()
change would be trivial. I'll play on a nightly branch and see how hard it would be without the log.
Yes I fully agree with this. For this operation only the delta log at a particular version is required and since it's a snapshot isolate we don't have to worry about conflicts. When it comes to avoiding clones, I don't think just having the state in Arc would solve it since you would need to clone the state to put it into an Arc which then prevent writes. Naively you could use a linked list where each node points to the previous version and then pass around the 'head' of the list. Then operations can write to the log without impacting the view of others and reconcile/fail when they read the ground truth. The primary challenge here would be with Rust's ownership model but I feel like there's a functional data struct that would solve and it would definitely require some reference counting. I might be completely off here >.< |
Hi @wjones127 and @roeap |
Thanks @Blajda! |
Description
Factor out vacuum errors and operation into it's own file and use a builder with reasonable defaults. No changes in functionality except the user can provide a
Duration
when specifying the retention period. This brings us in line with the reference implementation which allows for time periods less than a hour.This is a breakdown of #669 to only include the builder functionality.