Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimization to only list log files starting at a certain name #1252

Closed
wjones127 opened this issue Mar 28, 2023 · 17 comments · Fixed by #1410
Closed

Add optimization to only list log files starting at a certain name #1252

wjones127 opened this issue Mar 28, 2023 · 17 comments · Fixed by #1410
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@wjones127
Copy link
Collaborator

Description

When listing the _delta_log directory, we should be able to say "start listing at 000...0010000.json` to only get log files after the 10000th (skipping a lot of calls) if that's all we need.

Use Case

Related Issue(s)

@wjones127 wjones127 added the enhancement New feature or request label Mar 28, 2023
@wjones127
Copy link
Collaborator Author

apache/arrow-rs#3970

@rtyler
Copy link
Member

rtyler commented Mar 28, 2023

Good idea!

@wjones127
Copy link
Collaborator Author

Heard from Scott that this was recently implemented in the Spark impl. Definitely seems like low-hanging fruit.

@wjones127 wjones127 added the good first issue Good for newcomers label Mar 28, 2023
@ognis1205
Copy link
Contributor

Would you mind if I worked on this? While I would love to tackle some of the "aged" good first issues, I haven't been able to determine which ones are still active and require assistance. Therefore, I would like to begin by addressing a low-hanging fruit before diving into more extensive contributions.

@wjones127
Copy link
Collaborator Author

Feel free to! I'm not 100% sure this optimization actually applies to our codebase yet. It looks like instead of listing, we just look for the N+1 log file until we get a 404 error back:

delta-rs/rust/src/delta.rs

Lines 792 to 793 in 8a4b2b8

let commit_uri = commit_uri_from_version(next_version);
let commit_log_bytes = self.storage.get(&commit_uri).await;

But I think there's at least one function that could benefit now:

async fn find_latest_check_point_for_version(

@ognis1205
Copy link
Contributor

@wjones127
Thanks as always, Will!
I will look into the implementation and file a PR!

@ognis1205
Copy link
Contributor

FYI:
I still have not gotten started on this. There is another task on delta-rs.
Please forgive my lateness. I will start from this week.

@roeap
Copy link
Collaborator

roeap commented May 22, 2023

@ognis1205 - no worries at all. I do believe that for this optimization to really take effect, it needs to be added in the object_store crate rather then here, otherwise we would be getting the full payload in any case.

After a quick scan, it seems S3 supports this, although I haven't seen any explicit mention of ordering in the response, there is a "start-after" header that can be passed to list. I guess If there is a concept of "after" there should be a concept of ordering as well :).

For Azure I unfortunately have not found an equivalent parameter.

@ognis1205
Copy link
Contributor

@roeap
Thank you for understanding and sharing your helpful comments! It really helps me!

@ognis1205
Copy link
Contributor

ognis1205 commented May 25, 2023

FYI:

  • AWS

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_RequestParameters

Objects are returned sorted in an ascending order of the respective key names in the list

https://docs.aws.amazon.com/AmazonS3/latest/userguide/ListingKeysUsingAPIs.html

List results are always returned in UTF-8 binary order

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_Example_3
Response is exclusive

  • GCP

https://cloud.google.com/storage/docs/json_api/v1/objects/list

Filter results to objects whose names are lexicographically before endOffset. If startOffset is also set, the objects listed have names between startOffset (inclusive) and endOffset (exclusive)

  • Azure

https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?tabs=azure-ad
Does not support

  • object_store 0.5.6

https://github.com/apache/arrow-rs/blob/master/object_store/CHANGELOG-old.md
Support list_with_offset feature for AWS

https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html#tymethod.list

Note: the order of returned ObjectMeta is not guaranteed

@ognis1205
Copy link
Contributor

ognis1205 commented May 27, 2023

  • Add list_with_offset

impl ObjectStore for DeltaObjectStore {

  • Try to use list_with_offset instead

while let PeekCommit::New(new_version, actions) =

  • Try to use list_with_offset instead

let mut stream = self.storage.list(Some(self.storage.log_path())).await?;

In this part, ObjectStore::list is used to find the latest checkpoint before the specified version so just rename version to on_or_before instead.

@ognis1205
Copy link
Contributor

I think I found a bug relating to this issue:
#1403

@ognis1205
Copy link
Contributor

@wjones127 @roeap

Hi folks,

Sorry for being late; I have been a bit busy lately. Regarding the issue, after reviewing the current implementation of delta-rs and object_store, I noticed something:

  • The ObjectStore::list and ObjectStore::list_with_offset methods return ObjectMeta, which does not provide any access to a file system. In other words, we have to call the ObjectStore::get method to search for each commit file until we receive a 404 error.

  • As for the find_latest_checkpoint_for_version method, the specified version number is used to query all commit logs before it. Therefore, I believe the proposed optimization cannot be applied here.

Instead, while reviewing the current implementation, I guess I found a bug relating to incremental_update. I will work on it if would not mind.

@wjones127
Copy link
Collaborator Author

We don't use ObjectStore::list_with_offset yet, but I think we will need it to support reading multipart checkpoint files. delta-io/delta#946

@ognis1205
Copy link
Contributor

@wjones127
I understand. Please allow me make it clear again:

  • Add list_with_offset delegating method to DeltaObjectStore
  • No optimization added
  • Regarding the issue what I suspect a bug, I will make another PR + reproducing code

Would you mind if I proceed this way?

@wjones127
Copy link
Collaborator Author

Yes, that sounds good

@ognis1205
Copy link
Contributor

@wjones127
I filed an PR:
#1410

wjones127 pushed a commit that referenced this issue May 30, 2023
# Description
Adds the `list_with_offset` delegation method to `DeltaObjectStore`.

# Related Issue(s)
- closes #1252 

# Documentation

apache/arrow-rs#3970

Signed-off-by: Shingo OKAWA <[email protected]>
roeap pushed a commit to roeap/delta-rs that referenced this issue Jun 2, 2023
# Description
Adds the `list_with_offset` delegation method to `DeltaObjectStore`.

# Related Issue(s)
- closes delta-io#1252 

# Documentation

apache/arrow-rs#3970

Signed-off-by: Shingo OKAWA <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants