Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjectStore write support #2185

Closed
wjones127 opened this issue Apr 9, 2022 · 7 comments
Closed

ObjectStore write support #2185

wjones127 opened this issue Apr 9, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@wjones127
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We are looking at improving the filesystem / object store support in delta-rs, but it seems like it would be better to work on that inside of datafusion's data-access crate instead of doing all that work in delta-rs. delta-rs currently has file system support for local fs, gcs, s3, and adls, with just reading and write whole files. I think we'll want to add streaming reads and writes.

Describe the solution you'd like

Design and implement a streaming write interface into the ObjectStore trait.

Describe alternatives you've considered

We could do that work in delta-rs and then contribute it back here later. But it might not transfer well. For example, the current delta-rs S3 filesystem use rusoto, while the datafusion object store uses the AWS SDK.

@wjones127 wjones127 added the enhancement New feature or request label Apr 9, 2022
@xudong963
Copy link
Member

xudong963 commented Apr 9, 2022

related to #2025. cc @matthewmturner

@Cheappie
Copy link
Contributor

Cheappie commented Apr 9, 2022

I wonder what quality of write support do you plan to provide ? Production ready implementation of data ingestion can be as large effort as having to create another project like Apache Kafka.

@matthewmturner
Copy link
Contributor

thanks, @xudong963 and @wjones127. very happy to see this. also relates to #1777.

@wjones127
Copy link
Member Author

I wonder what quality of write support do you plan to provide ?

Basically, I would like for ObjectStore to be the Rust Datafusion equivalent of Arrow C++'s FileSystem or Python's fsspec. They provide a common interface to various object stores (S3, GCS, ADLS, HDFS, etc.) so that various projects implementing readers and writers (such as delta-rs) can simply use those filesystems instead of taking on the burden of writing and maintaining all those abstractions themselves.

Production ready implementation of data ingestion can be as large effort as having to create another project like Apache Kafka.

This is just the "filesystem" interaction, so just reading and writing bytes to various places with a uniform API. Other "writer" related things like file formats (parquet / json / csv) would be out of scope. Does that make sense?

@Cheappie
Copy link
Contributor

First of all I am just a stranger that evaluates datafusion query engine, I might lack some context so my point might not be valid for this case.

Yes sure that make sense. From what I see writer API adds point of failure to the upstream. For example how is It going to deal with data loss in case of process crash or missing permissions for write to the s3 bucket, etc... ? ObjectStore that just performs reads cannot corrupt datasource and from my perspective that is great. I would suggest to push this cross FS implementation into Rust Arrow repository same as C++ did then implementation would be even more reusable.

@andygrove
Copy link
Member

This issue is a little out of date. We recently switched to a new object store crate and it appears to support writes.

https://docs.rs/object_store/0.5.0/object_store/trait.ObjectStore.html

@wjones127
Copy link
Member Author

Yes! I'm happy to close this, and other issues can be files for any further integration work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants