-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a custom implementation LocalFileSystem::list_with_offset
#7019
Conversation
Co-authored-by: Adam Reeve <[email protected]>
The initial work for this PR was done by @dplasto, who reported the issue and provided an initial implementation. Further suggestions were provided by @adamreeve . |
It would appear this is legitimately failing CI |
Thanks @corwinjoy and @tustvold -- marking this as a draft as we work through the test failures |
I'm not sure what is happening with the CI. It seems to be failing
formatting. But, I ran 'cargo format' and clippy as specified in
Contributing.md so I am not sure what is wrong.
…On Sat, Jan 25, 2025, 9:14 AM Andrew Lamb ***@***.***> wrote:
Thanks @corwinjoy <https://github.com/corwinjoy> and @tustvold
<https://github.com/tustvold> -- marking this as a draft as we work
through the test failures
—
Reply to this email directly, view it on GitHub
<#7019 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBF42SFNON4OYCDVEG36AT2MPA6RAVCNFSM6AAAAABV2WCNBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUGAZTINRXGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Are you running from the arrow-rs directory or the object_store directory. When I run |
Ah, thanks for the clarification @etseidl! From the guidelines I didn't realize that I had to run these commands in the object-store rather than the arrow-rs directory. Doing that, I do get suggested changes which I have applied. Hopefully this will clear the CI checks. |
The emulator tests appear to be failing: https://github.com/apache/arrow-rs/actions/runs/12970828706/job/36180799417?pr=7019
|
…or testing rather than pointing to existing dir.
OK. I have changed this test to use a temporary test and files which should make it more portable. We will see what the emulator says about this. |
Looks like the tests are good |
This PR looks ready for review to me. @corwinjoy shall we mark it as ready for review? |
@alamb Yes please. Mark it as ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @corwinjoy -- I think this code is better than what is on main.
I am somewhat concerned that the listing with offset is going to result in non deterministic results but I don't think this PR makes that any better/worse than what is on main
Thanks again
None => config.root.to_file_path().unwrap(), | ||
}; | ||
|
||
let walkdir = WalkDir::new(root_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some digging and I don't think WalkDir guarantees anything about the sort order of the directory entries
thus if the OS returns a different order on the next call to list_with_offset
this may end up with a different result
However I think the existing default impl of list_with_offset
(which just calls list() and takes the offset) has the same problem so I don't think it needs to be fixed in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look to me like you can explicitly set an order via https://docs.rs/walkdir/latest/walkdir/struct.WalkDir.html#method.sort_by
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I agree with all this and thanks for the helpful comments! I think your concern about not having a deterministic order is a good one and agree this is the existing behavior. I also think we should change this, but propose we do that in a follow-up PR since it does change the behavior and could (potentially) break an assumption somewhere. I think we can just use https://docs.rs/walkdir/latest/walkdir/struct.WalkDir.html#method.sort_by_file_name
which should be fine and I believe matches what the offset filter expects.
.map(|x| String::from(x.filename().unwrap())) | ||
.collect(); | ||
|
||
// Check result with direct filesystem read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this depends on the file system returning the same order each time. Again this is probably ok but it seems like it could result in a non deterministic test failure over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it does not depend on this because later, in line 1485 we sort both result sets before comparing. So the initial sort order that gets returned does not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that if the underlying directory list was different, then the values after skipping the first 10 (for example) might be different, even if we sorted the values that came out.
But since this doesn't seem to happen in practice I am not going to worry too much about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not had time to review this in detail but a couple of points that may or may not be relevant:
- We do not guarantee the order of list results - https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html#tymethod.list
- Filesystems on many systems do not yield a consistent ordering - it is normally stable in the absence of mutation, but this isn't a strict guarantee
Or to phrase it either way, we shouldn't need to sort, nor should we rely on data being sorted
For context on why we don't guarantee any ordering, it is because many stores, e.g. S3 Express, don't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- thank you
To be clear I don't think this PR changes anything about the object_store operation in the face of inconsistent ordering between calls (but I was speculating on what would happen if the ordering returned by the underlying systems was different between calls)
I'll plan to merge this PR in a day or two unless there are any more comments or anyone would like longer to review |
Thanks again @corwinjoy |
Which issue does this PR close?
Closes #7018.
Rationale for this change
Performance enhancement needed for network file system as described in the PR.
What changes are included in this PR?
Add an implementation
LocalFileSystem::list_with_offset
so that it can intelligently scan files and drastically reduce the number of statx system calls. Basically, we add a prefilter to pull only what is needed in-place.Are there any user-facing changes?
None.