-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement filesystem check #1103
Conversation
Instead of heading each individual file, wouldn't it be more efficient to issue an prefixed list call to gather all current files in the table? This way, we can reduce API call count by about 1000x. |
@houqp I modified the process to instead use a list call but I'm not quite sure on how I would take advantage of prefixes in this case. For partitioned tables it could perform a list per partition path but I think non-partitioned tables would still require a full list on the root. For non-partitioned tables would it be possible to perform a list without including the files in |
@Blajda the prefix filter can only be applied with the table root unfortunately, so it will always include the _detal_log folder unfortunately. The main benefit it provides is we can restrict the listing to the table we are running the operation on, not all other tables stored in the same bucket. |
@roeap @wjones127 want to take a final look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this! I've been thinking having an operation like this would let me sleep better at night :)
I think we should address the case where data files are outside the root, either explicitly not supporting or using HEAD for those ones. Also noticed a few minor issues.
|
||
impl FileSystemCheckBuilder { | ||
/// Create a new [`FileSystemCheckBuilder`] | ||
pub fn new(store: Arc<DeltaObjectStore>, state: DeltaTableState) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate that the provided state is the latest version of the table? Since I don't think this is a valid operation for earlier states? or maybe that is an overall issue with the operations module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dry run of this operation would be helpful for diagnosing issues that arise from a combination of vacuum + time travel. If executed on a older version I do expect it to fail during the commit operation.
I can add a test to demonstrate that.
files_to_check.insert(active.path.to_owned(), active); | ||
}); | ||
|
||
let mut files = self.store.list(None).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delta Tables are allowed to have data files outside the table root. Currently, if that is the case, this operation will remove them all from the log. I see two options:
- Call HEAD on each file outside the root to check if they exist.
- Error if we detect any files outside the root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Do any examples of tables that use absolute paths?
I just checked the protocol and noticed absolute paths are encoded as defined in rfc2396
It states the following
An absolute URI contains the name of the scheme being used (<scheme>)
followed by a colon (":") and then a string (the <scheme-specific-
part>) whose interpretation depends on the scheme.
So I can segment paths based of it the scheme is set on the path and then use the correct api call for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a good point! Checking for the scheme at the beginning makes sense.
We don't have any example tables right now. We don't generally support them in most operations, but we should try to keep them in mind. Hard to have example tables without setting up a public S3 bucket or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable with supporting absolute paths without having some integration tests. Currently the operation will fail when an absolute path is encountered.
Add wjones127 suggestions Co-authored-by: Will Jones <[email protected]>
@wjones127 I updated the PR to fail when an absolute path exists in the Delta log. Also added an additional test to demonstrate the operation will only work on the latest table version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Blajda - one small question along the lines of hoping to eventually support absolute paths in the delta log. Otherwise looks great!
fn is_absolute_path(path: &str) -> DeltaResult<bool> { | ||
match Url::parse(path) { | ||
Ok(_) => Ok(true), | ||
Err(ParseError::RelativeUrlWithoutBase) => Ok(false), | ||
Err(_) => Err(DeltaTableError::Generic(format!( | ||
"Unable to parse path: {}", | ||
&path | ||
))), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will not work for local paths. One pattern we apply in object_store
and here is to try and cannonicalize the path and if it fails proceed with url parsing. e.g.
Lines 355 to 358 in a60129b
if let Ok(path) = std::fs::canonicalize(table_uri) { | |
return Url::from_directory_path(path) | |
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string())); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow on why it would not work on local paths. Paths in the Delta log must follow rfc2396 which states absolute paths must have a scheme.
Do you have a example that would cause the unit test for this function to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well .. in that case I am wrong! If that paths are required to have a scheme, then url parsing should always work for valid paths ... The windows case is also covered in your tests, so all is well :).
# Description Implementation of the filesystem check operation. The implementation is fairly straight forward with a HEAD call being made for each active file to check if it exists. A remove action is then made for each file that is orphaned. An alternative solution is instead to maintain a hashset with all active files and then recursively list all files. If the file exists then remove from the set. All remaining files in the set are then considered orphaned. Looking for feedback and if the second approach is preferred I can make the changes # Related Issue(s) - closes delta-io#1092 --------- Co-authored-by: Will Jones <[email protected]>
Description
Implementation of the filesystem check operation.
The implementation is fairly straight forward with a HEAD call being made for each active file to check if it exists.
A remove action is then made for each file that is orphaned.
An alternative solution is instead to maintain a hashset with all active files and then recursively list all files. If the file exists then remove from the set. All remaining files in the set are then considered orphaned.
Looking for feedback and if the second approach is preferred I can make the changes
Related Issue(s)