Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend & generalize constant folding / evaluation in logical optimizer #237

Closed
Dandandan opened this issue May 2, 2021 · 10 comments
Closed
Labels
datafusion Changes in the datafusion crate enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The (logical) optimizer contains some support for folding (boolean) constants. This can help, especially with other optimization passes, to optimize queries. For example, LIMIT (0 + 0) could be optimized first to LIMIT 0, to enable eliminating the whole plan.

We should try to extend this support to most datatypes & expressions.

Describe the solution you'd like
Exprs can already be evaluated against a RecordBatch, and there is code to evaluate scalar values without going through Arrow. To make sure that the constant evaluation is implemented correctly & the same as the evaluation code, we should be able to reuse the code from there.

Describe alternatives you've considered
Manually implement the constant folding support. Downside here is that we end up with two implementations, which has a higher maintenance burden.

Additional context
Not in scope: add it to physical optimizer too. Here it could help too, especially if we have support for partitions.

@Dandandan Dandandan added the enhancement New feature or request label May 2, 2021
@alamb
Copy link
Contributor

alamb commented May 3, 2021

This would be a neat feature.

@alamb alamb added the datafusion Changes in the datafusion crate label May 3, 2021
@Dandandan
Copy link
Contributor Author

@alamb just picking your brain here - do you think this should be part of the logical optimizations or physical optimizations?
I think the suggested route (re-use evaluation code) is only feasible for the physical optimization, not the logical optimization rules.

A way that could work within the current setup for PhysicalExpr:

  • Evaluation is implemented for PhysicalExprs not Exprs. An empty RecordBatch could be used to call into the code and extract the scalar values (if any).

This makes it a bit less useful (still useful nonetheless), as some other optimizations might benefit from constant folding

I am wondering here in general, whether we can/should unify LogicalPlan/PhysicalPlan Expr/PhysicalExpr a bit more in order to not have to write two versions of the same thing or being limited in the optimizations / optimization order.

@jorgecarleitao
Copy link
Member

fwiw, when a logical optimization is applied, the expressions are re-written and the "column name" is consequently re-written. Thus, what was named LIMIT (0 + 0) becomes LIMIT 0.

To apply it on the logical level, we may need to wrap the expression by an .alias for it to preserve the column name.

I agree that the sooner in the optimization these are applied, the higher the likelihood of synergies between optimizers.

@Dandandan
Copy link
Contributor Author

@jorgecarleitao that's a good one - I did also see something in the same order recently when looking at this #268 problem.

@Dandandan
Copy link
Contributor Author

It is a problem already in the current constant folding! I am opening an issue for this.

> SELECT TRUE = TRUE;
+---------------+
| Boolean(true) |
+---------------+
| true          |
+---------------+
1 rows in set. Query took 0 seconds.

@alamb
Copy link
Contributor

alamb commented May 7, 2021

@Dandandan

@alamb just picking your brain here - do you think this should be part of the logical optimizations or physical optimizations?
I think the suggested route (re-use evaluation code) is only feasible for the physical optimization, not the logical optimization rules.

I would imagine this to be done on Exprs, not PhysicalExprs to allow the rewritten expressions to be used as much as possible by other optimization passes (e.g. filter and projection pushdown, which is done at the LogicalPlan level)

I am wondering here in general, whether we can/should unify LogicalPlan/PhysicalPlan Expr/PhysicalExpr a bit more in order to not have to write two versions of the same thing or being limited in the optimizations / optimization order.

I think the LogicalPlan / PhysicalPlan distinction makes sense (b/c logically a Join is just a Join -- but physically maybe we would be using a CROSS JOIN w/ filter, or an Hash Inner Join, or a Merge Join, etc)

I am not as sure about the distinction between Expr and PhysicalExpr -- I haven't looked carefully at the code to know what additional information a PhysicalExpr needs that an Expr doesn't have -- and you can make a PhysicalExpr from an Expr and a Schema code link

If we could directly evaluate Exprs without having to apply a transformation to them that would be pretty cool (and clean up a lot of duplication I think)

@alamb
Copy link
Contributor

alamb commented May 19, 2021

FYI while I was reviewing the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/parquet.rs in the context of #363 I noticed there is already a way to do "partial evaluation" for expressions -- maybe we could fake the same to evaluate Exprs that have no inputs -- pass in a 1 row null array as input or something.

@Dandandan Dandandan added the performance Make DataFusion faster label May 24, 2021
@alamb
Copy link
Contributor

alamb commented Oct 11, 2021

Possibly related to #1070

@alamb
Copy link
Contributor

alamb commented Sep 12, 2022

I think we have implemented most of the suggestions in this issue -- I am not sure if it is tracking anything actionable anymore

@Dandandan
Copy link
Contributor Author

I think we have implemented most of the suggestions in this issue -- I am not sure if it is tracking anything actionable anymore

Yes I agree, this is done 🚀. Closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

3 participants