Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add datafusion storage catalog #1381

Merged
merged 10 commits into from
May 28, 2023
Merged

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented May 20, 2023

Description

Adds a datafusion SchemaProvider implementation based on discovering Delta tables in a storage location via _delta_log folders.

Related Issue(s)

Documentation

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels May 20, 2023
@roeap roeap force-pushed the storage-catalog branch from 439d77a to 75019f5 Compare May 20, 2023 10:33
rust/Cargo.toml Outdated Show resolved Hide resolved
/// Underlying object store
store: Arc<dyn ObjectStore>,
/// A map of table names to a fully quilfied storage location
tables: DashMap<String, String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an artifact of implementing this SchemaProvider that requires the use of DashMap? Since this adds a new dependency, I'm curious what forces this requirement since it's not clear to me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also go the "classic" route using Arc<Mutex<...>>, however since datafusion internally switched in most places, we already have that as a transient dependency if we use datafusion.

It does however simplify all code where we use this.

@roeap roeap requested a review from rtyler May 23, 2023 18:06
rust/src/data_catalog/storage/mod.rs Outdated Show resolved Hide resolved
rust/src/data_catalog/storage/mod.rs Outdated Show resolved Hide resolved
parent = p;
}
}
for table in tables.into_iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of logic is a bit obtuse to me, especially the series of chained calls to create file_name. I think it's a good candidate for refactoring into a private function with some unit tests, but at your discretion there since I know sometimes these objects are a PITA to scaffold for tests

@roeap roeap requested a review from rtyler May 27, 2023 13:25
@roeap roeap merged commit 106ad12 into delta-io:main May 28, 2023
@roeap roeap deleted the storage-catalog branch May 28, 2023 18:29
roeap added a commit to roeap/delta-rs that referenced this pull request Jun 2, 2023
Adds a datafusion `SchemaProvider` implementation based on discovering
Delta tables in a storage location via `_delta_log` folders.

<!---
For example:

- closes #106
--->

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants