-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Usage of async object store APIs in consuming code #1313
Comments
Hi @roeap , the current usages of sync version of object store API in DataFusion are because Parquet/CSV/JSON/Avro readers/writers in arrow-rs are all constructed based on We could certainly move to use the async version of object store when the file format reader/writers are re-implemented using async API. Please refer to apache/arrow-rs#111 if you are interested in making these readers/writers async first. |
Thanks @yjshen, thanks for your answer and certainly I will try to have a look at arrow-rs and see if I can contribute there. From what I understand in arrow2 a lot of work has already been done and should serve as a nice basis. To that extend I saw that there has been some work done already migrating to arrow2 in this project. It this still being worked on, and if so, does it make sense to contribute to that migration? |
@roeap Yes, we have the migration experiment to arrow2 in #68. Currently, the PR is slightly lagging from the master branch, especially for the recently added "ListingTable" that uses the object store API. If you want to start from there, the first thing might be to catch up with the master branch. Then arrow2 already has an async Parquet reader that we could use for async object-store implementation. Another important note from my side: Arrow2 is not an ASF project yet. The best we can achieve for the moment is to maintain datafusion-arrow2 as a separate branch in this repo, and regularly execute chores to synchronize it with the master branch. If this development model suits you, you are very welcome to join us on this arrow2 work. |
I guess given the above, I would start by trying to provide async read/write in arrow-rs inspired by the work in arrow2. Subsequently I'd be happy to contribute to maintaining an arrow2 branch until it becomes an official asf project. Are there currently any plans to create a branch in this repo for arrow2, or would contributions land on the current external branch? |
Yes, we do have plan to maintain an arrow2 branch until @jorgecarleitao thinks arrow2 is ready, see: #68 (comment). I think we should be ready to merge #68 into a new arrow2 branch after another round of merge with the current master. I am going to switch back to the arrow2 port this weekend now that 6.0.0 release is out. I will ping you in that PR when it's ready for collaboration. |
@houqp @yjshen - it's been a while and since then a lot has happened :). Specifically we have async support now in The trait in here uses In contrast to the sync api, a method for a full reader does not exist (yet?) for the async case. My thinking right now to approach this would be to switch to using the tokio traits to be consistent with the parquet reader (an alternative to use one of the compat methods) and add a reader method somewhat like this .. async fn reader(&self) -> Result<Box<dyn AsyncRead + AsyncSeek + Unpin>>; and then try applying Is this on the right track? |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
From what I can gather from the codebase, it seems the synchronous APIs are exclusively used when using the object store. At least in my experience, the SDKs for remote object stores have asynchronous methods as primary APIs. Furthermore invoking a synchronous method inside async is straight forward by spawning a blocking task. On the other hand I could not figure out a way to call async code inside a sync method when already inside a tokio runtime (i.e. cannot call block_on ...)
Describe the solution you'd like
Use the async object store APIs by default, to facilitate implementation of remote object stores.
Describe alternatives you've considered
Alternatively a flag on the object store might be used to check if sync or async calls should be used. Or maybe I missed an obvious solution :)
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: