Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Usage of async object store APIs in consuming code #1313

Closed
roeap opened this issue Nov 15, 2021 · 6 comments
Closed

[Question] Usage of async object store APIs in consuming code #1313

roeap opened this issue Nov 15, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@roeap
Copy link
Contributor

roeap commented Nov 15, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

From what I can gather from the codebase, it seems the synchronous APIs are exclusively used when using the object store. At least in my experience, the SDKs for remote object stores have asynchronous methods as primary APIs. Furthermore invoking a synchronous method inside async is straight forward by spawning a blocking task. On the other hand I could not figure out a way to call async code inside a sync method when already inside a tokio runtime (i.e. cannot call block_on ...)

Describe the solution you'd like

Use the async object store APIs by default, to facilitate implementation of remote object stores.

Describe alternatives you've considered

Alternatively a flag on the object store might be used to check if sync or async calls should be used. Or maybe I missed an obvious solution :)

Additional context
Add any other context or screenshots about the feature request here.

@roeap roeap added the enhancement New feature or request label Nov 15, 2021
@yjshen
Copy link
Member

yjshen commented Nov 16, 2021

Hi @roeap , the current usages of sync version of object store API in DataFusion are because Parquet/CSV/JSON/Avro readers/writers in arrow-rs are all constructed based on std::io::{Read, Seek, SeekFrom, Write...}, which are all sync trait.

We could certainly move to use the async version of object store when the file format reader/writers are re-implemented using async API. Please refer to apache/arrow-rs#111 if you are interested in making these readers/writers async first.

@roeap
Copy link
Contributor Author

roeap commented Nov 16, 2021

Thanks @yjshen, thanks for your answer and certainly I will try to have a look at arrow-rs and see if I can contribute there. From what I understand in arrow2 a lot of work has already been done and should serve as a nice basis. To that extend I saw that there has been some work done already migrating to arrow2 in this project. It this still being worked on, and if so, does it make sense to contribute to that migration?

@yjshen
Copy link
Member

yjshen commented Nov 16, 2021

@roeap Yes, we have the migration experiment to arrow2 in #68. Currently, the PR is slightly lagging from the master branch, especially for the recently added "ListingTable" that uses the object store API. If you want to start from there, the first thing might be to catch up with the master branch. Then arrow2 already has an async Parquet reader that we could use for async object-store implementation.

Another important note from my side: Arrow2 is not an ASF project yet. The best we can achieve for the moment is to maintain datafusion-arrow2 as a separate branch in this repo, and regularly execute chores to synchronize it with the master branch. If this development model suits you, you are very welcome to join us on this arrow2 work.

@roeap
Copy link
Contributor Author

roeap commented Nov 16, 2021

I guess given the above, I would start by trying to provide async read/write in arrow-rs inspired by the work in arrow2. Subsequently I'd be happy to contribute to maintaining an arrow2 branch until it becomes an official asf project. Are there currently any plans to create a branch in this repo for arrow2, or would contributions land on the current external branch?

@houqp
Copy link
Member

houqp commented Nov 18, 2021

Are there currently any plans to create a branch in this repo for arrow2, or would contributions land on the current external branch?

Yes, we do have plan to maintain an arrow2 branch until @jorgecarleitao thinks arrow2 is ready, see: #68 (comment).

I think we should be ready to merge #68 into a new arrow2 branch after another round of merge with the current master. I am going to switch back to the arrow2 port this weekend now that 6.0.0 release is out. I will ping you in that PR when it's ready for collaboration.

@roeap
Copy link
Contributor Author

roeap commented Apr 27, 2022

@houqp @yjshen - it's been a while and since then a lot has happened :). Specifically we have async support now in arros-rs as well and I was hoping to see if I could help implementing async support for the ObjectReader in this end as well. Looking into what that might entail I stubled across a few questions and I was hoping you could provide some guidance if I understood things correctly.

The trait in here uses AsyncRead from futures, while the parquet implementation uses the tokio traits. Seeing that tokio already is a non optional dependency in data-access, would it be Ok to change that API?

In contrast to the sync api, a method for a full reader does not exist (yet?) for the async case. My thinking right now to approach this would be to switch to using the tokio traits to be consistent with the parquet reader (an alternative to use one of the compat methods) and add a reader method somewhat like this ..

async fn reader(&self) -> Result<Box<dyn AsyncRead + AsyncSeek + Unpin>>;

and then try applying ParquetRecordBatchStream to simplify ParquetExec. (maybe in a second PR)

Is this on the right track?

@roeap roeap closed this as completed Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants