-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatch in MemTable of Select Into when projecting on aggregate window functions #6566
Changes from 2 commits
15903d4
19bc990
c19381a
b2804fa
2504a80
b6a5ee4
7f2c641
a27ac77
f9ab83f
22fac6e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -518,7 +518,7 @@ impl SessionContext { | |
let physical = DataFrame::new(self.state(), input); | ||
|
||
let batches: Vec<_> = physical.collect_partitioned().await?; | ||
let table = Arc::new(MemTable::try_new(schema, batches)?); | ||
let table = Arc::new(MemTable::new_not_registered(schema, batches)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like the core problem here is As I think @comphead is suggesting in #6566 (review), this PR seems to be trying to workaround the deeper problem of the window exec producing an output schema (different column names) than the LogicalPlan says. Have you looked into making the names consistent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, my first suggestion is to use longer version of the name in planner (what is done in the LogicalPlan). However, there need to be many changes in tests, and the BoundedWindowAggExec lines may become too long. If it is not a problem, we can solve the issue like that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think using the same names in physical and logical plans is preferable because the rest of the parts of the code expects this and sometimes makes assumptions that it is the case (because it mostly is). If we don't make the logical and physical plans match up, I predict we will continue to hit a long tail of bugs related to schema mismatches, only when using window functions related to the discrepancy. If the long display name is a problem (and I can see how it would be) perhaps we can figure out how to make Here is what postgres does: postgres=# select first_value(x) over (order by x) from foo;
first_value
-------------
1
(1 row) We probably need to do something more sophisticated as DataFusion needs distinct column names. |
||
|
||
self.register_table(&name, table)?; | ||
self.return_empty_dataframe() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code effectively ignores the
schema
argument if any ofpartitions
has a RecordBatch and uses that RecordBatch's schema instead.I think this is a surprising behavior and might mask errors in the future