-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid changing expression names during constant folding #1319
Conversation
Thanks @viirya. |
@capkurmagati Oh, I've not tested the cli before you asked. I just tested it and found another issue. It is because we do twice optimization on the logical plan, first time is when parsing sql into DataFrame, second time is when creating physical plan. Just updated with new change. |
Hmm, there is a test |
I wonder if another possible bug fix might be to change the Like for example, rewrite |
I may be mistaken, but I think @ic4y added aliases in https://github.com/apache/arrow-datafusion/pull/1315/files#diff-1d33be1a7e8231e53102eab8112e30aa89d8f5cb8c21cd25bcfbce3050cdb433R110 |
@alamb Thanks for looking into this. I've thought about adding aliases too. Just tried with the current simple approach to see if it works. |
Great find! I agree the general fix suggested by @alamb (to add / keep the original alias) is what we should do in this case. In general, I think the order of the optimizers shouldn't have any influence on the outcome of the query - only on the performance. |
Also tested with the cli:
|
// expression name for them. | ||
let is_plan_for_projection_pushdown = matches!( | ||
plan, | ||
LogicalPlan::Window { .. } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only those?
What about SELECT 1+1
.
Currently this outputs:
❯ SELECT 1+1
;
+----------+
| Int64(2) |
+----------+
| 2 |
+----------+
1 row in set. Query took 0.001 seconds.
I would assume it should keep the Int64(1) + Int64(1)
here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, for Project, it will create many (looks redundant) aliases. Some looks okay but some looks really weird, e.g. some failed tests:
Projection: #test.a, #test.d, NOT #test.b AS test.b = Boolean(false)
...
Projection: Int32(0) AS CAST(Utf8(\"0\") AS Int32)
...
We have a lot tests that would be failed due to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried following the model in https://github.com/apache/arrow-datafusion/pull/1315/files#diff-1d33be1a7e8231e53102eab8112e30aa89d8f5cb8c21cd25bcfbce3050cdb433R110 ? I think that calls columnize_expr
among perhaps some other differences.
Basically I think the code needs to do something like walk over the field names in the output schema
and if they names of the rewritten exprs weren't the same add an alias;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I agree with @Dandandan that this should apply to all nodes, not just a few special cased ones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I think the code needs to do something like walk over the field names in the output schema and if they names of the rewritten exprs weren't the same add an alias;
This sounds promising. No, I've not tried columnize_expr
. Let me revise this and see if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for trying @viirya -- I'll see if I can find some time this weekend to mess around with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb . I'll keep trying on this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, for Project, it will create many (looks redundant) aliases. Some looks okay but some looks really weird, e.g. some failed tests:
@viirya the example you gave here looks like correct behavior to me, are you concerned with lots of updates on the tests? or are there other unwanted side effect of this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm simply unsure if such changes are okay here as it looks like most queries will be affected (not about its results but the cosmetic one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it looks good for you, I will update all the tests.
This reverts commit d767aeb.
Note that for the current approach that compares expression names, it works. The only issue is if we apply to all nodes there are many tests needed to be updated because many aliases are to be added there. E.g.,
That's why I limit to certain nodes that I think mostly we like to add aliases to deal with projection push down rule. If you think this is okay, I can make it apply to all nodes and update these tests. |
For the approach of looking field name, it is similar. Some tests are needed to update, e.g.
|
If we inevitably need to update the tests with additional aliases, I'm not sure which one you prefer? |
I've updated all affected tests. Now the aliasing is applied on all nodes. Please let me know if you think this is okay. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @viirya , turns out we have not been honoring our own invariant: https://arrow.apache.org/datafusion/specification/invariants.html#logical-schema-is-invariant-under-logical-optimization :D
Thank you @houqp |
@@ -1349,6 +1349,15 @@ pub fn unnormalize_cols(exprs: impl IntoIterator<Item = Expr>) -> Vec<Expr> { | |||
exprs.into_iter().map(unnormalize_col).collect() | |||
} | |||
|
|||
/// Recursively un-alias an expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "recursively" part may be misleading, this function unwraps all current aliases
So an expr like (a as "foo") + (b as "bar")
will not be unaliased, but an expr like (a as "foo") as "bar"
will be unaliased to "foo"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sticking with this @viirya -- I think this is looking very good (and I agree the explain plan changes are improvements).
@@ -92,6 +92,10 @@ impl OptimizerRule for ConstantFolding { | |||
.expressions() | |||
.into_iter() | |||
.map(|e| { | |||
// We need to keep original expression name, if any. | |||
// Constant folding should not change expression name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Ok(new_e) | ||
} | ||
} else { | ||
Ok(new_e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry we may be silently ignoring some real issues in the future.
However, I tried checking expr_name
and new_expr_name
for errors and I got a bunch of errors like
---- execution::context::tests::window_partition_by stdout ----
Error: Internal("Create name does not support sort expression")
thread 'execution::context::tests::window_partition_by' panicked at 'assertion failed: `(left == right)`
left: `1`,
right: `0`: the test returned a termination value with a non-zero status code (1) which indicates a failure', /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5
stack backtrace:
0: rust_begin_unwind
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5
1: core::panicking::panic_fmt
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14
2: core::panicking::assert_failed_inner
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:177:23
3: core::panicking::assert_failed
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:140:5
4: test::assert_test_result
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5
5: datafusion::execution::context::tests::window_partition_by::{{closure}}
at ./src/execution/context.rs:1771:11
6: core::ops::function::FnOnce::call_once
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
7: core::ops::function::FnOnce::call_once
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
So I suppose this is as good as we are going to do for now
@@ -626,8 +641,8 @@ mod tests { | |||
|
|||
let expected = "\ | |||
Projection: #test.a\ | |||
\n Filter: NOT #test.c\ | |||
\n Filter: #test.b\ | |||
\n Filter: NOT #test.c AS test.c = Boolean(false)\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you @alamb @Dandandan @houqp @capkurmagati ! |
I think this PR had an unintended side-effect on some other optimizers, like filter push down. |
Which issue does this PR close?
Closes #1316.
Rationale for this change
This patch fixes a bug happened when users construct an aggregate function which has constants that can be folded.
As in projection pushdown rule, we check if an expression output (e.g., an aggregate function) is required by above plan (e.g. a projection) by comparing expression's name. But after constant folding rule, it is likely that the name of an aggregate function is changed (e.g., from
COUNT(1 + 1)
toCOUNT(2)
). Changed aggregate function is removed as we wrongly think it as unnecessary expression as the top projection requires#COUNT(1 + 1)
, not#COUNT(2)
.What changes are included in this PR?
In
ConstantFolding
optimizer rules, keeping the original expression name unchanged.Are there any user-facing changes?
No