Extend & generalize constant folding / evaluation in logical optimizer #237

Dandandan · 2021-05-02T09:53:08Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The (logical) optimizer contains some support for folding (boolean) constants. This can help, especially with other optimization passes, to optimize queries. For example, LIMIT (0 + 0) could be optimized first to LIMIT 0, to enable eliminating the whole plan.

We should try to extend this support to most datatypes & expressions.

Describe the solution you'd like
Exprs can already be evaluated against a RecordBatch, and there is code to evaluate scalar values without going through Arrow. To make sure that the constant evaluation is implemented correctly & the same as the evaluation code, we should be able to reuse the code from there.

Describe alternatives you've considered
Manually implement the constant folding support. Downside here is that we end up with two implementations, which has a higher maintenance burden.

Additional context
Not in scope: add it to physical optimizer too. Here it could help too, especially if we have support for partitions.

The text was updated successfully, but these errors were encountered:

alamb · 2021-05-03T21:05:42Z

This would be a neat feature.

Dandandan · 2021-05-06T20:05:10Z

@alamb just picking your brain here - do you think this should be part of the logical optimizations or physical optimizations?
I think the suggested route (re-use evaluation code) is only feasible for the physical optimization, not the logical optimization rules.

A way that could work within the current setup for PhysicalExpr:

Evaluation is implemented for PhysicalExprs not Exprs. An empty RecordBatch could be used to call into the code and extract the scalar values (if any).

This makes it a bit less useful (still useful nonetheless), as some other optimizations might benefit from constant folding

I am wondering here in general, whether we can/should unify LogicalPlan/PhysicalPlan Expr/PhysicalExpr a bit more in order to not have to write two versions of the same thing or being limited in the optimizations / optimization order.

jorgecarleitao · 2021-05-06T20:12:20Z

fwiw, when a logical optimization is applied, the expressions are re-written and the "column name" is consequently re-written. Thus, what was named LIMIT (0 + 0) becomes LIMIT 0.

To apply it on the logical level, we may need to wrap the expression by an .alias for it to preserve the column name.

I agree that the sooner in the optimization these are applied, the higher the likelihood of synergies between optimizers.

Dandandan · 2021-05-06T20:15:26Z

@jorgecarleitao that's a good one - I did also see something in the same order recently when looking at this #268 problem.

Dandandan · 2021-05-06T20:25:51Z

It is a problem already in the current constant folding! I am opening an issue for this.

> SELECT TRUE = TRUE;
+---------------+
| Boolean(true) |
+---------------+
| true          |
+---------------+
1 rows in set. Query took 0 seconds.

alamb · 2021-05-07T18:58:38Z

@Dandandan

@alamb just picking your brain here - do you think this should be part of the logical optimizations or physical optimizations?
I think the suggested route (re-use evaluation code) is only feasible for the physical optimization, not the logical optimization rules.

I would imagine this to be done on Exprs, not PhysicalExprs to allow the rewritten expressions to be used as much as possible by other optimization passes (e.g. filter and projection pushdown, which is done at the LogicalPlan level)

I am wondering here in general, whether we can/should unify LogicalPlan/PhysicalPlan Expr/PhysicalExpr a bit more in order to not have to write two versions of the same thing or being limited in the optimizations / optimization order.

I think the LogicalPlan / PhysicalPlan distinction makes sense (b/c logically a Join is just a Join -- but physically maybe we would be using a CROSS JOIN w/ filter, or an Hash Inner Join, or a Merge Join, etc)

I am not as sure about the distinction between Expr and PhysicalExpr -- I haven't looked carefully at the code to know what additional information a PhysicalExpr needs that an Expr doesn't have -- and you can make a PhysicalExpr from an Expr and a Schema code link

If we could directly evaluate Exprs without having to apply a transformation to them that would be pretty cool (and clean up a lot of duplication I think)

alamb · 2021-05-19T17:46:35Z

FYI while I was reviewing the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/parquet.rs in the context of #363 I noticed there is already a way to do "partial evaluation" for expressions -- maybe we could fake the same to evaluate Exprs that have no inputs -- pass in a 1 row null array as input or something.

alamb · 2021-10-11T11:12:15Z

Possibly related to #1070

alamb · 2022-09-12T11:31:59Z

I think we have implemented most of the suggestions in this issue -- I am not sure if it is tracking anything actionable anymore

Dandandan · 2022-09-14T20:32:56Z

I think we have implemented most of the suggestions in this issue -- I am not sure if it is tracking anything actionable anymore

Yes I agree, this is done 🚀. Closing this issue

Dandandan added the enhancement New feature or request label May 2, 2021

alamb added the datafusion Changes in the datafusion crate label May 3, 2021

Dandandan mentioned this issue May 6, 2021

Improve unnamed expressions #279

Open

Dandandan added the performance Make DataFusion faster label May 24, 2021

Dandandan closed this as completed Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend & generalize constant folding / evaluation in logical optimizer #237

Extend & generalize constant folding / evaluation in logical optimizer #237

Dandandan commented May 2, 2021

alamb commented May 3, 2021

Dandandan commented May 6, 2021

jorgecarleitao commented May 6, 2021

Dandandan commented May 6, 2021

Dandandan commented May 6, 2021

alamb commented May 7, 2021

alamb commented May 19, 2021

alamb commented Oct 11, 2021 •

edited

Loading

alamb commented Sep 12, 2022

Dandandan commented Sep 14, 2022

Extend & generalize constant folding / evaluation in logical optimizer #237

Extend & generalize constant folding / evaluation in logical optimizer #237

Comments

Dandandan commented May 2, 2021

alamb commented May 3, 2021

Dandandan commented May 6, 2021

jorgecarleitao commented May 6, 2021

Dandandan commented May 6, 2021

Dandandan commented May 6, 2021

alamb commented May 7, 2021

alamb commented May 19, 2021

alamb commented Oct 11, 2021 • edited Loading

alamb commented Sep 12, 2022

Dandandan commented Sep 14, 2022

alamb commented Oct 11, 2021 •

edited

Loading