-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datafusion integration assumes table's data files are local #43
Comments
Yeah, unfortunately, datafusion uses arrow parquet readers, which only supports local file at the moment: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/parquet.rs#L181. I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that. @nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :) |
Makes sense to me!
Sounds good, I'll keep an eye on it and try and contribute an Azure reader when the time comes. |
What could work in the interim is to use DataFusion's in-memory datasource (https://docs.rs/datafusion/2.0.0/datafusion/datasource/memory/index.html). When we have async-support on Parquet, then we can change to the relevant methods. |
@nevi-me is there a bug anywhere to track S3 support? I took a brief look in the Arrow and Datafusion repos and didn't find anything. If you're open to it it's something that we could potentially look in to contributing. |
@meastham feel free to start a discussion for s3 support in the upstream datafusion github repo or in the arrow dev mailing list. |
Given object store support in datafusion, can a blob path integration be implemented assuming we have appropriate blobstore implementation of object_store interface? I understand that given this, we can pass the file names prefixed with appropriate storage handler name from delta-rs, but my question is, is datafusion execution plan integration with this data source complete or is it still in progress? |
@gopik yes, we are pending on upstream object store support for s3. datafusion execution plan integration is all complete other than partition column support, which should be fairly straight forward to add. |
@houqp When you say upstream object support for s3, will that be part of datafusion project or it'll be part of an integration that is embedding datafusion? |
@gopik it will be part of datafusion, see apache/datafusion#907 |
With the adoption of https://github.com/delta-io/delta-rs/blob/main/rust/tests/integration_datafusion.rs |
The Datafusion integration passes a list of file paths representing a table's actual data to Datafusion's ParquetExec, but if the Delta table's StorageBackend is anything other than the FileStorageBackend then this fails because the files aren't local.
I'm not sure where this should be handled though - it feels like this should be part of Datafusion or an extension crate?
The text was updated successfully, but these errors were encountered: