Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Upstream Spice AI changes to datafusion-federation #51

Merged
merged 54 commits into from
Sep 2, 2024

Conversation

phillipleblanc
Copy link
Collaborator

@phillipleblanc phillipleblanc commented Aug 29, 2024

This PR contains all of the changes that the spiceai project made to datafusion-federation to fix bugs and add features to make federation work better for our use-cases.

The main two features that were added:

  1. The name of the table registered in DataFusion doesn't need to match the name of the table in the remote database. This was an important requirement for our product, and we did a lot of work to get this working correctly.
  2. The Arrow schema returned by the provider may not exactly match what DataFusion expects it to be, so we add a schema casting layer to make sure. Where possible this leverages the native Arrow cast kernels, but there are a few special cases that are handled.

phillipleblanc and others added 30 commits May 9, 2024 14:56
* testing

* testing2

* Rewrite all table scans to use federated table name

* Wrap the table scan in a subquery alias
* Rewrite subquery table scans to point to remote table

* Add tracing

* more tracing

* more debug logging

* Handle subquery in binary expressions
* Fix the table scan rewrite to properly rewrite column relations

* Handle more expressions for table scan rewrites
Update

Support for multiple table occurrences in single expression

Support for multiple different tables in single aggregation expression

Document rewrite_column_name_in_expr logic

Update

Fix spelling
…uery when executing (#10)

* keep original qualified_field when wrap_projections

* only mutate the table reference when sending plan to execute

* formatting

* rename method
* do not over eager rewrite column when col relation is there

* add more tests

* Update sources/sql/src/lib.rs

Co-authored-by: Phillip LeBlanc <[email protected]>

---------

Co-authored-by: Phillip LeBlanc <[email protected]>
sgrebnov and others added 22 commits August 25, 2024 11:07
* Add List types parsing support to schema cast

* Update to use arrow-json for lists parsing
* Add List types parsing support to schema cast

* Update to use arrow-json for lists parsing

* Add interval cast given original schema

* Fix test

---------

Co-authored-by: sgrebnov <[email protected]>
Co-authored-by: Phillip LeBlanc <[email protected]>
* Add Structs parsing support to schema cast

* Optimize memory space
Co-authored-by: Suriya Kandaswamy <[email protected]>
Co-authored-by: Suriya Kandaswamy <[email protected]>
@phillipleblanc phillipleblanc changed the title Upstream Spice AI changes to datafusion-federation feat: Upstream Spice AI changes to datafusion-federation Aug 29, 2024
edition = "2021"
license = "MIT"
license = "Apache-2.0"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt this would be better since DataFusion is an Apache project?

pub use table_provider::{FederatedTableProviderAdaptor, FederatedTableSource};

// TODO clean up this
// TODO move schema_cast.rs to schema_cast directory
pub mod schema_cast;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schema_cast module is useful when the schema registered in DataFusion and the schema returned by the underlying provider don't match exactly. The schema_cast provides a mechanism to keep the schema aligned with what DataFusion expects.

Copy link
Collaborator

@hozan23 hozan23 Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put TODO there as reminder to clean up the module by moving schema_cast.rs to schema_cast/mod.rs

@@ -100,6 +113,481 @@ impl OptimizerRule for SQLFederationOptimizerRule {
false
}
}

/// Rewrite table scans to use the original federated table name.
fn rewrite_table_scans(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to support the table name in the remote source being different from the table name registered in DataFusion. This required a lot of trial and error to get right, and we have a good integration test coverage of this in spiceai.

datafusion-federation/src/sql/mod.rs Outdated Show resolved Hide resolved
@phillipleblanc phillipleblanc marked this pull request as ready for review August 29, 2024 12:39
@backkem
Copy link
Collaborator

backkem commented Sep 2, 2024

The rewriting does seem like good generic functionality so I'm fine to accept it in.

@hozan23 hozan23 merged commit af09abe into datafusion-contrib:main Sep 2, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants