Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support arbitrary expressions in LIMIT plan #13028

Merged
merged 10 commits into from
Oct 24, 2024

Conversation

jonahgao
Copy link
Member

Which issue does this PR close?

Closes #9821.

Rationale for this change

SKIP and FETCH in the LIMIT plan can support arbitrary expressions.

pub struct Limit {
    pub skip: Option<Box<Expr>>,
    pub fetch: Option<Box<Expr>>,
    pub input: Arc<LogicalPlan>,
}

This olny extends the LogicalPlan of LIMIT. It relies on SimplifyExpressions to fold expressions into constants; if it can't, it will skip executing certain optimization rules, and report an error when creating the physical plan.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Yes.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait proto Related to proto crate labels Oct 21, 2024
@@ -623,8 +623,8 @@ pub async fn from_substrait_rel(
from_substrait_rel(ctx, input, extensions).await?,
);
let offset = fetch.offset as usize;
// Since protobuf can't directly distinguish `None` vs `0` `None` is encoded as `MAX`
let count = if fetch.count as usize == usize::MAX {
// -1 means that ALL records should be returned
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the substrait spec, we should use -1 here.

..
}) => {
LogicalPlan::Limit(limit) => {
// Attempt to display `skip` and `fetch` as literals if possible, otherwise as expressions.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Display literals as before to avoid breaking too many tests. Maybe we could display them in the expr-style through a follow-up PR. For example, 1 -> Int64(1).

@@ -2799,14 +2817,71 @@ impl PartialOrd for Extension {
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
pub struct Limit {
/// Number of rows to skip before fetch
pub skip: usize,
pub skip: Option<Box<Expr>>,
Copy link
Member Author

@jonahgao jonahgao Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Box here to prevent increasing the size of LogicalPlan, that isstd::mem::size_of::<LogicalPlan>(); otherwise it will cause a stack overflow in one of the array_ndims test

@jonahgao jonahgao added the api change Changes the API exposed to users of the crate label Oct 21, 2024
Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonahgao it makes sense to me. 👍 Only some minor suggestions.

datafusion/optimizer/src/eliminate_limit.rs Outdated Show resolved Hide resolved
datafusion/expr/src/expr_rewriter/mod.rs Outdated Show resolved Hide resolved
@@ -2799,14 +2817,71 @@ impl PartialOrd for Extension {
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
pub struct Limit {
/// Number of rows to skip before fetch
pub skip: usize,
pub skip: Option<Box<Expr>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the constraints on the expression that can be used here?
For example, can it have any column references?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to do the constant folding when building Limit node, so that logical plan structure remains intact? See also #12723

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the constraints on the expression that can be used here? For example, can it have any column references?

I think it can be any integer expression, and can also contain column references. Both PostgreSQL and DuckDB support select 1 limit (select max(col0) from t).

v1.1.1-dev319 af39bd0dcf
D create table t as values(1);
D select 1 limit (select max(col0) from t);
┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
└───────┘

Comment on lines +2832 to +2833
/// Currently only supports expressions that can be folded into constants.
UnsupportedExpr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit is a relational operator, so this will always need to be constant-foldable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but some expressions like limit (select count(*) from t) can't be folded at the planning stage , so we need to keep them for conversion into physical expressions later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonahgao good point. I also thought about this, but ignored, assuming we don't plan to support this.

We could have Limit node as-is for now (and do trivial constant folding when building the plan), and introduce expressions in the plan when we add support for limit (<uncorrelated subquery>). WDYT?

Copy link
Member Author

@jonahgao jonahgao Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant folding requires the TypeCoercion and SimplifyExpressions rule. We can't directly invoke them during building for now and need to defer them to optimizer.

Another reason the Limit node needs to contain expressions is to support Prepare statements. Issue #12294 requires the Limit node to use Expr::Placeholder.

@2010YOUY01
Copy link
Contributor

Thank you. I have some suggestions for LIMIT's behavior:

Before this PR, only int literals are accepted

DataFusion CLI v42.0.0
> create table t1(v1 int);
0 row(s) fetched.
Elapsed 0.052 seconds.

> insert into t1 values (1),(2),(3);
+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.086 seconds.

> select * from t1 limit 1.5;
Error during planning: Unexpected expression in LIMIT clause
> select * from t1 limit true;
Error during planning: Unexpected expression in LIMIT clause

This PR:

> select * from t1 limit 1.5;
+----+
| v1 |
+----+
| 1  |
+----+
1 row(s) fetched.
Elapsed 0.006 seconds.

> select * from t1 offset 0.5 limit 10;
+----+
| v1 |
+----+
| 1  |
| 2  |
| 3  |
+----+
3 row(s) fetched.
Elapsed 0.004 seconds.

> select * from t1 limit false;
+----+
| v1 |
+----+
+----+
0 row(s) fetched.
Elapsed 0.009 seconds.

Postgres's behavior is: if OFFSET is not integer for example 0.5, 0th row won't be included, if LIMIT is 0.5, then 0th row would be included:

postgres=# select * from t1 limit 1.5;
 v1 
----
  1
  2
(2 rows)

postgres=# select * from t1 offset 0.5 limit 10;
 v1 
----
  2
  3
(2 rows)

postgres=# select * from t1 limit false;
ERROR:  argument of LIMIT must be type bigint, not type boolean
LINE 1: select * from t1 limit false;
  1. I think it's good to follow Postgres' behavior for non-integer arguments
  2. Don't support non-numeric arguments like bool, I think they are allowed to cast to integer, but not meaningful in SQL query (maybe there are other types that have a similar issue I'm not aware of)

@jonahgao
Copy link
Member Author

  1. I think it's good to follow Postgres' behavior for non-integer arguments
  2. Don't support non-numeric arguments like bool, I think they are allowed to cast to integer, but not meaningful in SQL query (maybe there are other types that have a similar issue I'm not aware of)

DuckDB supports them, but I think we can follow PG for now until someone requests these features.

@findepi
Copy link
Member

findepi commented Oct 22, 2024

Postgres's behavior is: if OFFSET is not integer for example 0.5, 0th row won't be included, if LIMIT is 0.5, then 0th row would be included:

@2010YOUY01 good point!
do you maybe know what is the use-case for allowing fractional values for LIMIT or OFFSET?

this looks like a pandora box. strictly speaking, casting them to integral values isn't exactly what we can do: OFFSET 1.5 LIMIT 1.5 should return 3rd row, but not the 4th row.

@2010YOUY01
Copy link
Contributor

@2010YOUY01 good point! do you maybe know what is the use-case for allowing fractional values for LIMIT or OFFSET?

Not really, allowing fractional values looks quite bizarre to me, however, nowadays many SQL are auto-generated so we can't make too many assumptions

this looks like a pandora box. strictly speaking, casting them to integral values isn't exactly what we can do: OFFSET 1.5 LIMIT 1.5 should return 3rd row, but not the 4th row.

This way it requires lots of effort to match PostgreSQL's behavior 🤔 Maybe it should be left to a later task

@jonahgao
Copy link
Member Author

Disallow non-integer types in e6d1297 . Fractional values are more complicated than I expected, let's handle them later. @2010YOUY01 @findepi

@goldmedal
Copy link
Contributor

Disallow non-integer types in e6d1297 . Fractional values are more complicated than I expected, let's handle them later. @2010YOUY01 @findepi

I think the floating value is caused by the DataFusion casting behavior is different from others. I tried to cast a float to an integer.

DataFusion

DataFusion CLI v42.0.0
> select 1.5::int;
+--------------+
| Float64(1.5) |
+--------------+
| 1            |
+--------------+
1 row(s) fetched. 
Elapsed 0.055 seconds.

> select 1.4::int;
+--------------+
| Float64(1.4) |
+--------------+
| 1            |
+--------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

Postgres

test=# select 1.5::int;
 int4 
------
    2
(1 row)

test=# select 1.4::int;
 int4 
------
    1
(1 row)

DuckDB

D select 1.5::int;
┌──────────────────────┐
│ CAST(1.5 AS INTEGER) │
│        int32         │
├──────────────────────┤
│                    2 │
└──────────────────────┘
D select 1.4::int;
┌──────────────────────┐
│ CAST(1.4 AS INTEGER) │
│        int32         │
├──────────────────────┤
│                    1 │
└─────

Postgres and DuckDB are rounding but DataFusion is round down.

So, that's why limit 1.5 only returns 1 row in DataFusion but returns 2 rows in Postgres.

@goldmedal
Copy link
Contributor

Disallow non-integer types in e6d1297 . Fractional values are more complicated than I expected, let's handle them later. @2010YOUY01 @findepi

I think the floating value is caused by the DataFusion casting behavior is different from others. I tried to cast a float to an integer.

Maybe related to #8302 and #11886

@jonahgao
Copy link
Member Author

I think the floating value is caused by the DataFusion casting behavior is different from others. I tried to cast a float to an integer.

Maybe related to #8302 and #11886

Maybe we can add support for floats after these issues are resolved🤔.

@findepi
Copy link
Member

findepi commented Oct 23, 2024

Not really, allowing fractional values looks quite bizarre to me, however, nowadays many SQL are auto-generated so we can't make too many assumptions

If i were generating queries, i wouldn't ask a database to do LIMIT 1+2, I'd generate LIMIT 3.
Do you maybe know what class of software generates LIMIT <expr> or LIMIT <fractional>?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jonahgao and @findepi and @goldmedal - this looks like a good change to me

let input = children.one()?;
let SkipType::Literal(skip) = limit.get_skip_type()? else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jonahgao jonahgao merged commit 3f3a0cf into apache:main Oct 24, 2024
24 checks passed
@jonahgao
Copy link
Member Author

Thanks @goldmedal @findepi @2010YOUY01 @alamb for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support arbitrary expressions in LIMIT clause
5 participants