Avoid changing expression names during constant folding #1319

viirya · 2021-11-16T22:58:04Z

Which issue does this PR close?

Closes #1316.

Rationale for this change

This patch fixes a bug happened when users construct an aggregate function which has constants that can be folded.

As in projection pushdown rule, we check if an expression output (e.g., an aggregate function) is required by above plan (e.g. a projection) by comparing expression's name. But after constant folding rule, it is likely that the name of an aggregate function is changed (e.g., from COUNT(1 + 1) to COUNT(2)). Changed aggregate function is removed as we wrongly think it as unnecessary expression as the top projection requires #COUNT(1 + 1), not #COUNT(2).

What changes are included in this PR?

In ConstantFolding optimizer rules, keeping the original expression name unchanged.

DataFusion CLI v5.1.0-SNAPSHOT

❯ SELECT count(1 + 1);
+----------------------------+
| COUNT(Int64(1) + Int64(1)) |
+----------------------------+
| 1                          |
+----------------------------+
1 row in set. Query took 0.006 seconds.
❯ SELECT 1 + 1;
+---------------------+
| Int64(1) + Int64(1) |
+---------------------+
| 2                   |
+---------------------+
1 row in set. Query took 0.001 seconds.

Are there any user-facing changes?

No

capkurmagati · 2021-11-17T05:46:54Z

Thanks @viirya.
I also thought it was the ConstantFolding optimizer leads to the bug when I filed the issue. But I couldn't digger further.
I tested the code locally and the test is passing but the cli still crashes for SELECT count(1 + 1).
I wonder if it works in your environment?

viirya · 2021-11-17T08:46:02Z

@capkurmagati Oh, I've not tested the cli before you asked. I just tested it and found another issue. It is because we do twice optimization on the logical plan, first time is when parsing sql into DataFrame, second time is when creating physical plan. Just updated with new change.

viirya · 2021-11-17T09:04:48Z

Hmm, there is a test ctx_sql_should_optimize_plan that assumes the logical plan after sql is optimized one. I thought it doesn't make sense as the logical plan is not a final one, e.g. you could do select etc. operations to produce new DataFrame (i.e., new logical plan). We only need to optimize it before creating physical plan (create_physical_plan).

alamb · 2021-11-17T15:00:02Z

Hmm, there is a test ctx_sql_should_optimize_plan that assumes the logical plan after sql is optimized one. I thought it doesn't make sense as the logical plan is not a final one, e.g. you could do select etc. operations to produce new DataFrame (i.e., new logical plan). We only need to optimize it before creating physical plan (create_physical_plan).

I wonder if another possible bug fix might be to change the ConstantEvaluation code to add an alias for aggregates? For example, rewrite aggregates so they explicitly keep the same (original) display name

Like for example, rewrite COUNT(1+1) to COUNT(2) as "COUNT(1+1)"

alamb · 2021-11-17T17:10:17Z

I may be mistaken, but I think @ic4y added aliases in https://github.com/apache/arrow-datafusion/pull/1315/files#diff-1d33be1a7e8231e53102eab8112e30aa89d8f5cb8c21cd25bcfbce3050cdb433R110

viirya · 2021-11-17T17:49:17Z

@alamb Thanks for looking into this. I've thought about adding aliases too. Just tried with the current simple approach to see if it works.

Dandandan · 2021-11-17T17:55:20Z

Great find!

I agree the general fix suggested by @alamb (to add / keep the original alias) is what we should do in this case. In general, I think the order of the optimizers shouldn't have any influence on the outcome of the query - only on the performance.

This reverts commit ebf67d3.

This reverts commit fe8445d.

viirya · 2021-11-18T03:11:08Z

Also tested with the cli:

DataFusion CLI v5.1.0-SNAPSHOT

❯ SELECT count(1 + 1);
+----------------------------+
| COUNT(Int64(1) + Int64(1)) |
+----------------------------+
| 1                          |
+----------------------------+
1 row in set. Query took 0.006 seconds.
❯

Dandandan · 2021-11-18T21:18:55Z

datafusion/src/optimizer/constant_folding.rs

+                        // expression name for them.
+                        let is_plan_for_projection_pushdown = matches!(
+                            plan,
+                            LogicalPlan::Window { .. }


Why only those?

What about SELECT 1+1.

Currently this outputs:

❯ SELECT 1+1 ; +----------+ | Int64(2) | +----------+ | 2 | +----------+ 1 row in set. Query took 0.001 seconds.

I would assume it should keep the Int64(1) + Int64(1) here instead.

For example, for Project, it will create many (looks redundant) aliases. Some looks okay but some looks really weird, e.g. some failed tests:

Projection: #test.a, #test.d, NOT #test.b AS test.b = Boolean(false) ...

Projection: Int32(0) AS CAST(Utf8(\"0\") AS Int32) ...

We have a lot tests that would be failed due to that.

Have you tried following the model in https://github.com/apache/arrow-datafusion/pull/1315/files#diff-1d33be1a7e8231e53102eab8112e30aa89d8f5cb8c21cd25bcfbce3050cdb433R110 ? I think that calls columnize_expr among perhaps some other differences.

Basically I think the code needs to do something like walk over the field names in the output schema and if they names of the rewritten exprs weren't the same add an alias;

(I agree with @Dandandan that this should apply to all nodes, not just a few special cased ones)

Basically I think the code needs to do something like walk over the field names in the output schema and if they names of the rewritten exprs weren't the same add an alias;

This sounds promising. No, I've not tried columnize_expr. Let me revise this and see if it works.

Thanks for trying @viirya -- I'll see if I can find some time this weekend to mess around with it

Thanks @alamb . I'll keep trying on this too.

For example, for Project, it will create many (looks redundant) aliases. Some looks okay but some looks really weird, e.g. some failed tests:

@viirya the example you gave here looks like correct behavior to me, are you concerned with lots of updates on the tests? or are there other unwanted side effect of this approach?

Oh, I'm simply unsure if such changes are okay here as it looks like most queries will be affected (not about its results but the cosmetic one).

If it looks good for you, I will update all the tests.

This reverts commit d767aeb.

viirya · 2021-11-19T22:04:41Z

Note that for the current approach that compares expression names, it works. The only issue is if we apply to all nodes there are many tests needed to be updated because many aliases are to be added there. E.g.,

optimizer::constant_folding::tests::optimize_plan_support_projection

  left: `"Projection: #test.a, #test.d, NOT #test.b AS test.b = Boolean(false)\n  TableScan: test projection=None"`,                                                         
 right: `"Projection: #test.a, #test.d, NOT #test.b\n  TableScan: test projection=None"`

optimizer::constant_folding::tests::optimize_plan_and_expr

  left: `"Projection: #test.a\n  Filter: NOT #test.b AND #test.c AS test.b != Boolean(true) AND test.c = Boolean(true)\n    TableScan: test projection=None"`,               
 right: `"Projection: #test.a\n  Filter: NOT #test.b AND #test.c\n    TableScan: test projection=None"`

optimizer::constant_folding::tests::optimize_plan_not_expr

  left: `"Projection: #test.a\n  Filter: #test.b AS NOT test.b = Boolean(false)\n    TableScan: test projection=None"`,                                                      
 right: `"Projection: #test.a\n  Filter: #test.b\n    TableScan: test projection=None"`'

That's why I limit to certain nodes that I think mostly we like to add aliases to deal with projection push down rule.

If you think this is okay, I can make it apply to all nodes and update these tests.

viirya · 2021-11-19T22:16:53Z

For the approach of looking field name, it is similar. Some tests are needed to update, e.g.

optimizer::constant_folding::tests::optimize_plan_support_projectionorg_expr

  left: `"Projection: #test.a, #test.d, NOT #test.b AS test.b = Boolean(false)\n  TableScan: test projection=None"`,                                     
 right: `"Projection: #test.a, #test.d, NOT #test.b\n  TableScan: test projection=None"`

optimizer::constant_folding::tests::optimize_plan_or_expr

  left: `"Projection: #test.a\n  Filter: NOT #test.b OR NOT #test.c AS a\n    TableScan: test projection=None"`,                                                             
 right: `"Projection: #test.a\n  Filter: NOT #test.b OR NOT #test.c\n    TableScan: test projection=None"`

optimizer::constant_folding::tests::optimize_plan_not_exp

  left: `"Projection: #test.a\n  Filter: #test.b AS a\n    TableScan: test projection=None"`,                                                                                
 right: `"Projection: #test.a\n  Filter: #test.b\n    TableScan: test projection=None"`

optimizer::constant_folding::tests::to_timestamp_expr_folded

  left: `"Projection: TimestampNanosecond(1599566400000000000)\n  TableScan: test projection=None"`,                                                                         
 right: `"Projection: TimestampNanosecond(1599566400000000000) AS totimestamp(Utf8(\"2020-09-08T12:00:00+00:00\"))\n  TableScan: test projection=None"`

viirya · 2021-11-19T22:18:42Z

If we inevitably need to update the tests with additional aliases, I'm not sure which one you prefer?

viirya · 2021-11-20T09:33:05Z

I've updated all affected tests. Now the aliasing is applied on all nodes. Please let me know if you think this is okay. Thanks.

houqp

Thank you @viirya , turns out we have not been honoring our own invariant: https://arrow.apache.org/datafusion/specification/invariants.html#logical-schema-is-invariant-under-logical-optimization :D

viirya · 2021-11-20T18:30:06Z

Thank you @houqp

alamb · 2021-11-22T21:48:39Z

datafusion/src/logical_plan/expr.rs

@@ -1349,6 +1349,15 @@ pub fn unnormalize_cols(exprs: impl IntoIterator<Item = Expr>) -> Vec<Expr> {
    exprs.into_iter().map(unnormalize_col).collect()
 }

+/// Recursively un-alias an expressions


The "recursively" part may be misleading, this function unwraps all current aliases

So an expr like (a as "foo") + (b as "bar") will not be unaliased, but an expr like (a as "foo") as "bar" will be unaliased to "foo"

alamb

Thank you for sticking with this @viirya -- I think this is looking very good (and I agree the explain plan changes are improvements).

alamb · 2021-11-22T21:48:59Z

datafusion/src/optimizer/constant_folding.rs

@@ -92,6 +92,10 @@ impl OptimizerRule for ConstantFolding {
                    .expressions()
                    .into_iter()
                    .map(|e| {
+                        // We need to keep original expression name, if any.
+                        // Constant folding should not change expression name.


alamb · 2021-11-22T21:56:58Z

datafusion/src/optimizer/constant_folding.rs

+                                Ok(new_e)
+                            }
+                        } else {
+                            Ok(new_e)


I worry we may be silently ignoring some real issues in the future.

However, I tried checking expr_name and new_expr_name for errors and I got a bunch of errors like

---- execution::context::tests::window_partition_by stdout ---- Error: Internal("Create name does not support sort expression") thread 'execution::context::tests::window_partition_by' panicked at 'assertion failed: `(left == right)` left: `1`, right: `0`: the test returned a termination value with a non-zero status code (1) which indicates a failure', /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5 stack backtrace: 0: rust_begin_unwind at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5 1: core::panicking::panic_fmt at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14 2: core::panicking::assert_failed_inner at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:177:23 3: core::panicking::assert_failed at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:140:5 4: test::assert_test_result at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5 5: datafusion::execution::context::tests::window_partition_by::{{closure}} at ./src/execution/context.rs:1771:11 6: core::ops::function::FnOnce::call_once at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5 7: core::ops::function::FnOnce::call_once at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

So I suppose this is as good as we are going to do for now

alamb · 2021-11-22T21:57:12Z

datafusion/src/optimizer/constant_folding.rs

@@ -626,8 +641,8 @@ mod tests {

        let expected = "\
        Projection: #test.a\
-        \n  Filter: NOT #test.c\
-        \n    Filter: #test.b\
+        \n  Filter: NOT #test.c AS test.c = Boolean(false)\


viirya · 2021-11-22T22:04:02Z

Thank you @alamb @Dandandan @houqp @capkurmagati !

Dandandan · 2021-11-26T17:20:22Z

I think this PR had an unintended side-effect on some other optimizers, like filter push down.
See #1367

Avoid changing output column name before pushdown projection.

a656b44

github-actions bot added the datafusion Changes in the datafusion crate label Nov 16, 2021

Avoid early optimization which is duplicate.

fe8445d

Modify test.

ebf67d3

alamb mentioned this pull request Nov 17, 2021

A problem about the projection_push_down optimizer gathers valid columns #1312

Closed

viirya added 5 commits November 17, 2021 10:05

Revert "Modify test."

8e52b94

This reverts commit ebf67d3.

Revert "Avoid early optimization which is duplicate."

3191563

This reverts commit fe8445d.

Add aliases during constant folding.

1cfbba0

Some expressions don't support name.

46315a1

Don't create redundant alias.

aa8cf15

viirya changed the title ~~Avoid changing output column name before pushdown projection~~ Avoid changing expression names during constant folding Nov 17, 2021

viirya added 2 commits November 17, 2021 16:46

Only add alias for certain plans.

7c22b5d

Fix clippy.

166b41a

Dandandan reviewed Nov 18, 2021

View reviewed changes

viirya added 2 commits November 18, 2021 18:52

Fix.

d767aeb

Revert "Fix."

8d715de

This reverts commit d767aeb.

viirya added 2 commits November 19, 2021 23:22

Apply to all nodes and update tests.

eec7cbe

Unalias when push donw to TableScan.

2a7652d

viirya force-pushed the issue_1316 branch from 575801b to 2a7652d Compare November 20, 2021 08:29

viirya added 2 commits November 20, 2021 00:33

Merge remote-tracking branch 'upstream/master' into issue_1316

22dac02

Update more tests.

546d2a2

Remove previous change.

6e53b53

houqp approved these changes Nov 20, 2021

View reviewed changes

houqp requested review from alamb and Dandandan November 22, 2021 03:53

houqp added the bug Something isn't working label Nov 22, 2021

alamb reviewed Nov 22, 2021

View reviewed changes

alamb approved these changes Nov 22, 2021

View reviewed changes

alamb merged commit 0df9b99 into apache:master Nov 22, 2021

Dandandan mentioned this pull request Nov 26, 2021

TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367

Closed

alamb mentioned this pull request Nov 28, 2021

Consolidate ConstantFolding and SimplifyExpression #1375

Merged

jackwener mentioned this pull request Apr 10, 2022

Remove Alias from Expr #1468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid changing expression names during constant folding #1319

Avoid changing expression names during constant folding #1319

viirya commented Nov 16, 2021 •

edited

Loading

capkurmagati commented Nov 17, 2021

viirya commented Nov 17, 2021

viirya commented Nov 17, 2021

alamb commented Nov 17, 2021

alamb commented Nov 17, 2021

viirya commented Nov 17, 2021

Dandandan commented Nov 17, 2021

viirya commented Nov 18, 2021

Dandandan Nov 18, 2021

viirya Nov 18, 2021

alamb Nov 18, 2021

alamb Nov 18, 2021

viirya Nov 18, 2021

alamb Nov 19, 2021

viirya Nov 19, 2021

houqp Nov 19, 2021

viirya Nov 20, 2021

viirya Nov 20, 2021

viirya commented Nov 19, 2021 •

edited

Loading

viirya commented Nov 19, 2021

viirya commented Nov 19, 2021

viirya commented Nov 20, 2021

houqp left a comment

viirya commented Nov 20, 2021

alamb Nov 22, 2021

alamb left a comment

alamb Nov 22, 2021

alamb Nov 22, 2021

alamb Nov 22, 2021

viirya commented Nov 22, 2021

Dandandan commented Nov 26, 2021

Avoid changing expression names during constant folding #1319

Avoid changing expression names during constant folding #1319

Conversation

viirya commented Nov 16, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

capkurmagati commented Nov 17, 2021

viirya commented Nov 17, 2021

viirya commented Nov 17, 2021

alamb commented Nov 17, 2021

alamb commented Nov 17, 2021

viirya commented Nov 17, 2021

Dandandan commented Nov 17, 2021

viirya commented Nov 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Nov 19, 2021 • edited Loading

viirya commented Nov 19, 2021

viirya commented Nov 19, 2021

viirya commented Nov 20, 2021

houqp left a comment

Choose a reason for hiding this comment

viirya commented Nov 20, 2021

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Nov 22, 2021

Dandandan commented Nov 26, 2021

viirya commented Nov 16, 2021 •

edited

Loading

viirya commented Nov 19, 2021 •

edited

Loading