Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid changing expression names during constant folding #1319

Merged
merged 17 commits into from
Nov 22, 2021

Conversation

viirya
Copy link
Member

@viirya viirya commented Nov 16, 2021

Which issue does this PR close?

Closes #1316.

Rationale for this change

This patch fixes a bug happened when users construct an aggregate function which has constants that can be folded.

As in projection pushdown rule, we check if an expression output (e.g., an aggregate function) is required by above plan (e.g. a projection) by comparing expression's name. But after constant folding rule, it is likely that the name of an aggregate function is changed (e.g., from COUNT(1 + 1) to COUNT(2)). Changed aggregate function is removed as we wrongly think it as unnecessary expression as the top projection requires #COUNT(1 + 1), not #COUNT(2).

What changes are included in this PR?

In ConstantFolding optimizer rules, keeping the original expression name unchanged.

DataFusion CLI v5.1.0-SNAPSHOT

❯ SELECT count(1 + 1);
+----------------------------+
| COUNT(Int64(1) + Int64(1)) |
+----------------------------+
| 1                          |
+----------------------------+
1 row in set. Query took 0.006 seconds.
❯ SELECT 1 + 1;
+---------------------+
| Int64(1) + Int64(1) |
+---------------------+
| 2                   |
+---------------------+
1 row in set. Query took 0.001 seconds.

Are there any user-facing changes?

No

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Nov 16, 2021
@capkurmagati
Copy link
Contributor

Thanks @viirya.
I also thought it was the ConstantFolding optimizer leads to the bug when I filed the issue. But I couldn't digger further.
I tested the code locally and the test is passing but the cli still crashes for SELECT count(1 + 1).
I wonder if it works in your environment?

@viirya
Copy link
Member Author

viirya commented Nov 17, 2021

@capkurmagati Oh, I've not tested the cli before you asked. I just tested it and found another issue. It is because we do twice optimization on the logical plan, first time is when parsing sql into DataFrame, second time is when creating physical plan. Just updated with new change.

@viirya
Copy link
Member Author

viirya commented Nov 17, 2021

Hmm, there is a test ctx_sql_should_optimize_plan that assumes the logical plan after sql is optimized one. I thought it doesn't make sense as the logical plan is not a final one, e.g. you could do select etc. operations to produce new DataFrame (i.e., new logical plan). We only need to optimize it before creating physical plan (create_physical_plan).

@alamb
Copy link
Contributor

alamb commented Nov 17, 2021

Hmm, there is a test ctx_sql_should_optimize_plan that assumes the logical plan after sql is optimized one. I thought it doesn't make sense as the logical plan is not a final one, e.g. you could do select etc. operations to produce new DataFrame (i.e., new logical plan). We only need to optimize it before creating physical plan (create_physical_plan).

I wonder if another possible bug fix might be to change the ConstantEvaluation code to add an alias for aggregates? For example, rewrite aggregates so they explicitly keep the same (original) display name

Like for example, rewrite COUNT(1+1) to COUNT(2) as "COUNT(1+1)"

@alamb
Copy link
Contributor

alamb commented Nov 17, 2021

@viirya
Copy link
Member Author

viirya commented Nov 17, 2021

@alamb Thanks for looking into this. I've thought about adding aliases too. Just tried with the current simple approach to see if it works.

@Dandandan
Copy link
Contributor

Great find!

I agree the general fix suggested by @alamb (to add / keep the original alias) is what we should do in this case. In general, I think the order of the optimizers shouldn't have any influence on the outcome of the query - only on the performance.

@viirya viirya changed the title Avoid changing output column name before pushdown projection Avoid changing expression names during constant folding Nov 17, 2021
@viirya
Copy link
Member Author

viirya commented Nov 18, 2021

Also tested with the cli:

DataFusion CLI v5.1.0-SNAPSHOT

❯ SELECT count(1 + 1);
+----------------------------+
| COUNT(Int64(1) + Int64(1)) |
+----------------------------+
| 1                          |
+----------------------------+
1 row in set. Query took 0.006 seconds.
❯ 

// expression name for them.
let is_plan_for_projection_pushdown = matches!(
plan,
LogicalPlan::Window { .. }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only those?

What about SELECT 1+1.

Currently this outputs:

❯ SELECT 1+1
;
+----------+
| Int64(2) |
+----------+
| 2        |
+----------+
1 row in set. Query took 0.001 seconds.

I would assume it should keep the Int64(1) + Int64(1) here instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for Project, it will create many (looks redundant) aliases. Some looks okay but some looks really weird, e.g. some failed tests:

Projection: #test.a, #test.d, NOT #test.b AS test.b = Boolean(false)
  ...
Projection: Int32(0) AS CAST(Utf8(\"0\") AS Int32)
  ...

We have a lot tests that would be failed due to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried following the model in https://github.com/apache/arrow-datafusion/pull/1315/files#diff-1d33be1a7e8231e53102eab8112e30aa89d8f5cb8c21cd25bcfbce3050cdb433R110 ? I think that calls columnize_expr among perhaps some other differences.

Basically I think the code needs to do something like walk over the field names in the output schema and if they names of the rewritten exprs weren't the same add an alias;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I agree with @Dandandan that this should apply to all nodes, not just a few special cased ones)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I think the code needs to do something like walk over the field names in the output schema and if they names of the rewritten exprs weren't the same add an alias;

This sounds promising. No, I've not tried columnize_expr. Let me revise this and see if it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying @viirya -- I'll see if I can find some time this weekend to mess around with it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb . I'll keep trying on this too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for Project, it will create many (looks redundant) aliases. Some looks okay but some looks really weird, e.g. some failed tests:

@viirya the example you gave here looks like correct behavior to me, are you concerned with lots of updates on the tests? or are there other unwanted side effect of this approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm simply unsure if such changes are okay here as it looks like most queries will be affected (not about its results but the cosmetic one).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it looks good for you, I will update all the tests.

This reverts commit d767aeb.
@viirya
Copy link
Member Author

viirya commented Nov 19, 2021

Note that for the current approach that compares expression names, it works. The only issue is if we apply to all nodes there are many tests needed to be updated because many aliases are to be added there. E.g.,

  1. optimizer::constant_folding::tests::optimize_plan_support_projection
  left: `"Projection: #test.a, #test.d, NOT #test.b AS test.b = Boolean(false)\n  TableScan: test projection=None"`,                                                         
 right: `"Projection: #test.a, #test.d, NOT #test.b\n  TableScan: test projection=None"`                                                 
  1. optimizer::constant_folding::tests::optimize_plan_and_expr
  left: `"Projection: #test.a\n  Filter: NOT #test.b AND #test.c AS test.b != Boolean(true) AND test.c = Boolean(true)\n    TableScan: test projection=None"`,               
 right: `"Projection: #test.a\n  Filter: NOT #test.b AND #test.c\n    TableScan: test projection=None"`
  1. optimizer::constant_folding::tests::optimize_plan_not_expr
  left: `"Projection: #test.a\n  Filter: #test.b AS NOT test.b = Boolean(false)\n    TableScan: test projection=None"`,                                                      
 right: `"Projection: #test.a\n  Filter: #test.b\n    TableScan: test projection=None"`'

That's why I limit to certain nodes that I think mostly we like to add aliases to deal with projection push down rule.

If you think this is okay, I can make it apply to all nodes and update these tests.

@viirya
Copy link
Member Author

viirya commented Nov 19, 2021

For the approach of looking field name, it is similar. Some tests are needed to update, e.g.

  1. optimizer::constant_folding::tests::optimize_plan_support_projectionorg_expr
  left: `"Projection: #test.a, #test.d, NOT #test.b AS test.b = Boolean(false)\n  TableScan: test projection=None"`,                                     
 right: `"Projection: #test.a, #test.d, NOT #test.b\n  TableScan: test projection=None"`
  1. optimizer::constant_folding::tests::optimize_plan_or_expr
  left: `"Projection: #test.a\n  Filter: NOT #test.b OR NOT #test.c AS a\n    TableScan: test projection=None"`,                                                             
 right: `"Projection: #test.a\n  Filter: NOT #test.b OR NOT #test.c\n    TableScan: test projection=None"`
  1. optimizer::constant_folding::tests::optimize_plan_not_exp
  left: `"Projection: #test.a\n  Filter: #test.b AS a\n    TableScan: test projection=None"`,                                                                                
 right: `"Projection: #test.a\n  Filter: #test.b\n    TableScan: test projection=None"`
  1. optimizer::constant_folding::tests::to_timestamp_expr_folded
  left: `"Projection: TimestampNanosecond(1599566400000000000)\n  TableScan: test projection=None"`,                                                                         
 right: `"Projection: TimestampNanosecond(1599566400000000000) AS totimestamp(Utf8(\"2020-09-08T12:00:00+00:00\"))\n  TableScan: test projection=None"`

@viirya
Copy link
Member Author

viirya commented Nov 19, 2021

If we inevitably need to update the tests with additional aliases, I'm not sure which one you prefer?

@viirya
Copy link
Member Author

viirya commented Nov 20, 2021

I've updated all affected tests. Now the aliasing is applied on all nodes. Please let me know if you think this is okay. Thanks.

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya
Copy link
Member Author

viirya commented Nov 20, 2021

Thank you @houqp

@houqp houqp requested review from alamb and Dandandan November 22, 2021 03:53
@houqp houqp added the bug Something isn't working label Nov 22, 2021
@@ -1349,6 +1349,15 @@ pub fn unnormalize_cols(exprs: impl IntoIterator<Item = Expr>) -> Vec<Expr> {
exprs.into_iter().map(unnormalize_col).collect()
}

/// Recursively un-alias an expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "recursively" part may be misleading, this function unwraps all current aliases

So an expr like (a as "foo") + (b as "bar") will not be unaliased, but an expr like (a as "foo") as "bar" will be unaliased to "foo"

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sticking with this @viirya -- I think this is looking very good (and I agree the explain plan changes are improvements).

@@ -92,6 +92,10 @@ impl OptimizerRule for ConstantFolding {
.expressions()
.into_iter()
.map(|e| {
// We need to keep original expression name, if any.
// Constant folding should not change expression name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Ok(new_e)
}
} else {
Ok(new_e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry we may be silently ignoring some real issues in the future.

However, I tried checking expr_name and new_expr_name for errors and I got a bunch of errors like

---- execution::context::tests::window_partition_by stdout ----
Error: Internal("Create name does not support sort expression")
thread 'execution::context::tests::window_partition_by' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`: the test returned a termination value with a non-zero status code (1) which indicates a failure', /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14
   2: core::panicking::assert_failed_inner
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:177:23
   3: core::panicking::assert_failed
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:140:5
   4: test::assert_test_result
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5
   5: datafusion::execution::context::tests::window_partition_by::{{closure}}
             at ./src/execution/context.rs:1771:11
   6: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

So I suppose this is as good as we are going to do for now

@@ -626,8 +641,8 @@ mod tests {

let expected = "\
Projection: #test.a\
\n Filter: NOT #test.c\
\n Filter: #test.b\
\n Filter: NOT #test.c AS test.c = Boolean(false)\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 0df9b99 into apache:master Nov 22, 2021
@viirya
Copy link
Member Author

viirya commented Nov 22, 2021

Thank you @alamb @Dandandan @houqp @capkurmagati !

@Dandandan
Copy link
Contributor

I think this PR had an unintended side-effect on some other optimizers, like filter push down.
See #1367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregate function with expr panics when from table is absent
5 participants