You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've also added a lot of testing for federation across many different databases, including (almost) full support for TPCH, TPCDS and Clickbench queries in DataFusion federating to different backend systems.
What we've found is that we need very fine control over when the federation rule is run relative to other analyzer/optimizer rules, and we actually run several optimizer rules directly in the federation rule (to handle projection/limit/filter push down, before we run the federation logic).
The PR #37 changed the Federation rule from an AnalyzerRule to an OptimizerRule, but when we tried to mimic that change - lots of queries that were previous working, stopped working. In particular, several of the default DataFusion rules convert the LogicalPlan into a format that breaks the unparser, or change the semantics enough that it isn't general purpose anymore.
One example is the ResolveGroupingFunction analyzer rule - this replaces several expressions for grouping sets with a DataFusion-only __grouping_id column reference - and this breaks unparsing for queries that use grouping sets. The optimize_projections rule also modifes the LogicalPlan in ways that break unparsing (see apache/datafusion#13267). Even the scalar_subquery_to_join optimizer rule that attempts to rewrite subqueries into joins we've found broke a lot of our queries.
What we've needed to do in our codebase to get datafusion-federation working well is to customize exactly which rules run before/after, as well as run a subset of optimizer rules within the federation rule itself. Concretely, that means we've needed to keep it as an Analyzer rule, not an Optimizer rule - plus have the ability to register other optimizer rules to run inside of the federation rule.
We'd really like to merge back with datafusion-federation and also upstream some of our testing framework here as well - but that would require converting the federation rule back to an Analyzer rule as I described above. Does that sound reasonable? Another alternative would be to support both flavors of federation - an Analyzer or Optimizer, and allow consuming projects to decide what works for them - and I think we could keep the core of the federation algorithm shared.
Thanks for fixing the subquery issue, I was investigating this issue recently.
Concretely, that means we've needed to keep it as an Analyzer rule, not an Optimizer rule - plus have the ability to register other optimizer rules to run inside of the federation rule.
That makes sense to me.
I am happy to help you with merging and upstreaming these changes.
I will soon be working to merge our codebase (github.com/spiceai/spiceai) to depend directly on
datafusion-contrib/datafusion-federation
instead of our existing fork atspiceai/datafusion-federation
. We've made a lot of improvements to our fork over time (including recently the ability to handle proper subquery OuterReferenceColumn federation)We've also added a lot of testing for federation across many different databases, including (almost) full support for TPCH, TPCDS and Clickbench queries in DataFusion federating to different backend systems.
What we've found is that we need very fine control over when the federation rule is run relative to other analyzer/optimizer rules, and we actually run several optimizer rules directly in the federation rule (to handle projection/limit/filter push down, before we run the federation logic).
The PR #37 changed the Federation rule from an AnalyzerRule to an OptimizerRule, but when we tried to mimic that change - lots of queries that were previous working, stopped working. In particular, several of the default DataFusion rules convert the LogicalPlan into a format that breaks the unparser, or change the semantics enough that it isn't general purpose anymore.
One example is the
ResolveGroupingFunction
analyzer rule - this replaces several expressions for grouping sets with a DataFusion-only__grouping_id
column reference - and this breaks unparsing for queries that use grouping sets. Theoptimize_projections
rule also modifes the LogicalPlan in ways that break unparsing (see apache/datafusion#13267). Even thescalar_subquery_to_join
optimizer rule that attempts to rewrite subqueries into joins we've found broke a lot of our queries.What we've needed to do in our codebase to get
datafusion-federation
working well is to customize exactly which rules run before/after, as well as run a subset of optimizer rules within the federation rule itself. Concretely, that means we've needed to keep it as an Analyzer rule, not an Optimizer rule - plus have the ability to register other optimizer rules to run inside of the federation rule.We'd really like to merge back with
datafusion-federation
and also upstream some of our testing framework here as well - but that would require converting the federation rule back to an Analyzer rule as I described above. Does that sound reasonable? Another alternative would be to support both flavors of federation - an Analyzer or Optimizer, and allow consuming projects to decide what works for them - and I think we could keep the core of the federation algorithm shared.cc @backkem @hozan23
The text was updated successfully, but these errors were encountered: