-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support arbitrary expressions in LIMIT
plan
#13028
Conversation
@@ -623,8 +623,8 @@ pub async fn from_substrait_rel( | |||
from_substrait_rel(ctx, input, extensions).await?, | |||
); | |||
let offset = fetch.offset as usize; | |||
// Since protobuf can't directly distinguish `None` vs `0` `None` is encoded as `MAX` | |||
let count = if fetch.count as usize == usize::MAX { | |||
// -1 means that ALL records should be returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the substrait spec, we should use -1
here.
.. | ||
}) => { | ||
LogicalPlan::Limit(limit) => { | ||
// Attempt to display `skip` and `fetch` as literals if possible, otherwise as expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Display literals as before to avoid breaking too many tests. Maybe we could display them in the expr-style through a follow-up PR. For example, 1
-> Int64(1)
.
@@ -2799,14 +2817,71 @@ impl PartialOrd for Extension { | |||
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)] | |||
pub struct Limit { | |||
/// Number of rows to skip before fetch | |||
pub skip: usize, | |||
pub skip: Option<Box<Expr>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Box
here to prevent increasing the size of LogicalPlan
, that isstd::mem::size_of::<LogicalPlan>()
; otherwise it will cause a stack overflow in one of the array_ndims test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonahgao it makes sense to me. 👍 Only some minor suggestions.
Co-authored-by: Jax Liu <[email protected]>
Co-authored-by: Jax Liu <[email protected]>
@@ -2799,14 +2817,71 @@ impl PartialOrd for Extension { | |||
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)] | |||
pub struct Limit { | |||
/// Number of rows to skip before fetch | |||
pub skip: usize, | |||
pub skip: Option<Box<Expr>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the constraints on the expression that can be used here?
For example, can it have any column references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to do the constant folding when building Limit node, so that logical plan structure remains intact? See also #12723
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the constraints on the expression that can be used here? For example, can it have any column references?
I think it can be any integer expression, and can also contain column references. Both PostgreSQL and DuckDB support select 1 limit (select max(col0) from t)
.
v1.1.1-dev319 af39bd0dcf
D create table t as values(1);
D select 1 limit (select max(col0) from t);
┌───────┐
│ 1 │
│ int32 │
├───────┤
│ 1 │
└───────┘
/// Currently only supports expressions that can be folded into constants. | ||
UnsupportedExpr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limit is a relational operator, so this will always need to be constant-foldable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but some expressions like limit (select count(*) from t)
can't be folded at the planning stage , so we need to keep them for conversion into physical expressions later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonahgao good point. I also thought about this, but ignored, assuming we don't plan to support this.
We could have Limit
node as-is for now (and do trivial constant folding when building the plan), and introduce expressions in the plan when we add support for limit (<uncorrelated subquery>)
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant folding requires the TypeCoercion
and SimplifyExpressions
rule. We can't directly invoke them during building for now and need to defer them to optimizer.
Another reason the Limit node needs to contain expressions is to support Prepare statements. Issue #12294 requires the Limit node to use Expr::Placeholder
.
Thank you. I have some suggestions for Before this PR, only int literals are accepted DataFusion CLI v42.0.0
> create table t1(v1 int);
0 row(s) fetched.
Elapsed 0.052 seconds.
> insert into t1 values (1),(2),(3);
+-------+
| count |
+-------+
| 3 |
+-------+
1 row(s) fetched.
Elapsed 0.086 seconds.
> select * from t1 limit 1.5;
Error during planning: Unexpected expression in LIMIT clause
> select * from t1 limit true;
Error during planning: Unexpected expression in LIMIT clause This PR: > select * from t1 limit 1.5;
+----+
| v1 |
+----+
| 1 |
+----+
1 row(s) fetched.
Elapsed 0.006 seconds.
> select * from t1 offset 0.5 limit 10;
+----+
| v1 |
+----+
| 1 |
| 2 |
| 3 |
+----+
3 row(s) fetched.
Elapsed 0.004 seconds.
> select * from t1 limit false;
+----+
| v1 |
+----+
+----+
0 row(s) fetched.
Elapsed 0.009 seconds. Postgres's behavior is: if postgres=# select * from t1 limit 1.5;
v1
----
1
2
(2 rows)
postgres=# select * from t1 offset 0.5 limit 10;
v1
----
2
3
(2 rows)
postgres=# select * from t1 limit false;
ERROR: argument of LIMIT must be type bigint, not type boolean
LINE 1: select * from t1 limit false;
|
DuckDB supports them, but I think we can follow PG for now until someone requests these features. |
@2010YOUY01 good point! this looks like a pandora box. strictly speaking, casting them to integral values isn't exactly what we can do: |
Not really, allowing fractional values looks quite bizarre to me, however, nowadays many SQL are auto-generated so we can't make too many assumptions
This way it requires lots of effort to match PostgreSQL's behavior 🤔 Maybe it should be left to a later task |
Disallow non-integer types in e6d1297 . Fractional values are more complicated than I expected, let's handle them later. @2010YOUY01 @findepi |
I think the floating value is caused by the DataFusion casting behavior is different from others. I tried to cast a float to an integer. DataFusion
Postgres
DuckDB
Postgres and DuckDB are rounding but DataFusion is round down. So, that's why |
|
If i were generating queries, i wouldn't ask a database to do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jonahgao and @findepi and @goldmedal - this looks like a good change to me
let input = children.one()?; | ||
let SkipType::Literal(skip) = limit.get_skip_type()? else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @goldmedal @findepi @2010YOUY01 @alamb for the review. |
Which issue does this PR close?
Closes #9821.
Rationale for this change
SKIP
andFETCH
in theLIMIT
plan can support arbitrary expressions.This olny extends the
LogicalPlan
ofLIMIT
. It relies onSimplifyExpressions
to fold expressions into constants; if it can't, it will skip executing certain optimization rules, and report an error when creating the physical plan.What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
Yes.