Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: Rework FederationOptimizer rule back to an AnalyzerRule #94

Open
phillipleblanc opened this issue Dec 24, 2024 · 2 comments
Open
Assignees

Comments

@phillipleblanc
Copy link
Collaborator

I will soon be working to merge our codebase (github.com/spiceai/spiceai) to depend directly on datafusion-contrib/datafusion-federation instead of our existing fork at spiceai/datafusion-federation. We've made a lot of improvements to our fork over time (including recently the ability to handle proper subquery OuterReferenceColumn federation)

We've also added a lot of testing for federation across many different databases, including (almost) full support for TPCH, TPCDS and Clickbench queries in DataFusion federating to different backend systems.

What we've found is that we need very fine control over when the federation rule is run relative to other analyzer/optimizer rules, and we actually run several optimizer rules directly in the federation rule (to handle projection/limit/filter push down, before we run the federation logic).

The PR #37 changed the Federation rule from an AnalyzerRule to an OptimizerRule, but when we tried to mimic that change - lots of queries that were previous working, stopped working. In particular, several of the default DataFusion rules convert the LogicalPlan into a format that breaks the unparser, or change the semantics enough that it isn't general purpose anymore.

One example is the ResolveGroupingFunction analyzer rule - this replaces several expressions for grouping sets with a DataFusion-only __grouping_id column reference - and this breaks unparsing for queries that use grouping sets. The optimize_projections rule also modifes the LogicalPlan in ways that break unparsing (see apache/datafusion#13267). Even the scalar_subquery_to_join optimizer rule that attempts to rewrite subqueries into joins we've found broke a lot of our queries.

What we've needed to do in our codebase to get datafusion-federation working well is to customize exactly which rules run before/after, as well as run a subset of optimizer rules within the federation rule itself. Concretely, that means we've needed to keep it as an Analyzer rule, not an Optimizer rule - plus have the ability to register other optimizer rules to run inside of the federation rule.

We'd really like to merge back with datafusion-federation and also upstream some of our testing framework here as well - but that would require converting the federation rule back to an Analyzer rule as I described above. Does that sound reasonable? Another alternative would be to support both flavors of federation - an Analyzer or Optimizer, and allow consuming projects to decide what works for them - and I think we could keep the core of the federation algorithm shared.

cc @backkem @hozan23

@phillipleblanc phillipleblanc self-assigned this Dec 24, 2024
@backkem
Copy link
Collaborator

backkem commented Dec 24, 2024

The motivation is sound. I see no issue migrating back to an analyzer rule.

@hozan23
Copy link
Collaborator

hozan23 commented Dec 24, 2024

Thanks for fixing the subquery issue, I was investigating this issue recently.

Concretely, that means we've needed to keep it as an Analyzer rule, not an Optimizer rule - plus have the ability to register other optimizer rules to run inside of the federation rule.

That makes sense to me.

I am happy to help you with merging and upstreaming these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants