Discuss: Rework `FederationOptimizer` rule back to an `AnalyzerRule` #94

phillipleblanc · 2024-12-24T02:33:26Z

I will soon be working to merge our codebase (github.com/spiceai/spiceai) to depend directly on datafusion-contrib/datafusion-federation instead of our existing fork at spiceai/datafusion-federation. We've made a lot of improvements to our fork over time (including recently the ability to handle proper subquery OuterReferenceColumn federation)

We've also added a lot of testing for federation across many different databases, including (almost) full support for TPCH, TPCDS and Clickbench queries in DataFusion federating to different backend systems.

What we've found is that we need very fine control over when the federation rule is run relative to other analyzer/optimizer rules, and we actually run several optimizer rules directly in the federation rule (to handle projection/limit/filter push down, before we run the federation logic).

The PR #37 changed the Federation rule from an AnalyzerRule to an OptimizerRule, but when we tried to mimic that change - lots of queries that were previous working, stopped working. In particular, several of the default DataFusion rules convert the LogicalPlan into a format that breaks the unparser, or change the semantics enough that it isn't general purpose anymore.

One example is the ResolveGroupingFunction analyzer rule - this replaces several expressions for grouping sets with a DataFusion-only __grouping_id column reference - and this breaks unparsing for queries that use grouping sets. The optimize_projections rule also modifes the LogicalPlan in ways that break unparsing (see apache/datafusion#13267). Even the scalar_subquery_to_join optimizer rule that attempts to rewrite subqueries into joins we've found broke a lot of our queries.

What we've needed to do in our codebase to get datafusion-federation working well is to customize exactly which rules run before/after, as well as run a subset of optimizer rules within the federation rule itself. Concretely, that means we've needed to keep it as an Analyzer rule, not an Optimizer rule - plus have the ability to register other optimizer rules to run inside of the federation rule.

We'd really like to merge back with datafusion-federation and also upstream some of our testing framework here as well - but that would require converting the federation rule back to an Analyzer rule as I described above. Does that sound reasonable? Another alternative would be to support both flavors of federation - an Analyzer or Optimizer, and allow consuming projects to decide what works for them - and I think we could keep the core of the federation algorithm shared.

cc @backkem @hozan23

The text was updated successfully, but these errors were encountered:

backkem · 2024-12-24T10:00:41Z

The motivation is sound. I see no issue migrating back to an analyzer rule.

hozan23 · 2024-12-24T10:13:38Z

Thanks for fixing the subquery issue, I was investigating this issue recently.

Concretely, that means we've needed to keep it as an Analyzer rule, not an Optimizer rule - plus have the ability to register other optimizer rules to run inside of the federation rule.

That makes sense to me.

I am happy to help you with merging and upstreaming these changes.

phillipleblanc self-assigned this Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss: Rework `FederationOptimizer` rule back to an `AnalyzerRule` #94

Discuss: Rework `FederationOptimizer` rule back to an `AnalyzerRule` #94

phillipleblanc commented Dec 24, 2024

backkem commented Dec 24, 2024

hozan23 commented Dec 24, 2024 •

edited

Loading

Discuss: Rework FederationOptimizer rule back to an AnalyzerRule #94

Discuss: Rework FederationOptimizer rule back to an AnalyzerRule #94

Comments

phillipleblanc commented Dec 24, 2024

backkem commented Dec 24, 2024

hozan23 commented Dec 24, 2024 • edited Loading

Discuss: Rework `FederationOptimizer` rule back to an `AnalyzerRule` #94

Discuss: Rework `FederationOptimizer` rule back to an `AnalyzerRule` #94

hozan23 commented Dec 24, 2024 •

edited

Loading