fix: Limits are not applied correctly #14418

zhuqi-lucas · 2025-02-03T06:48:16Z

Which issue does this PR close?

Rationale for this change

Fix the behaviour for limit with CoalescePartitionsExec.

CoalescePartitionsExec will merge partitions into one, but each partition has it's locallimit, we should not remove the global limit before CoalescePartitionsExec.

What changes are included in this PR?

Fix the behaviour for limit with CoalescePartitionsExec.

Are these changes tested?

Yes, slt testing added.

Are there any user-facing changes?

It will fix the user facing issue.

Before this PR:

with selection as (
    select *
    from 'parquet_files/*'
    limit 1
)
select 1 as foo
from selection
order by duration
limit 1000;

I get:

+-----+
| foo |
+-----+
| 1   |
| 1   |
+-----+
2 row(s) fetched.

zhuqi-lucas · 2025-02-03T07:10:32Z

error: use of deprecated constant `arrow::datatypes::MAX_DECIMAL_FOR_EACH_PRECISION`: Use MAX_DECIMAL128_FOR_EACH_PRECISION (note indexes are different)
  --> datafusion/optimizer/src/unwrap_cast_in_comparison.rs:29:25
   |
29 |     DataType, TimeUnit, MAX_DECIMAL_FOR_EACH_PRECISION, MIN_DECIMAL_FOR_EACH_PRECISION,
   |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `-D deprecated` implied by `-D warnings`
   = help: to override `-D warnings` add `#[allow(deprecated)]`

error: use of deprecated constant `arrow::datatypes::MIN_DECIMAL_FOR_EACH_PRECISION`: Use MIN_DECIMAL_FOR_EACH_PRECISION (note indexes are different)
  --> datafusion/optimizer/src/unwrap_cast_in_comparison.rs:29:57
   |
29 |     DataType, TimeUnit, MAX_DECIMAL_FOR_EACH_PRECISION, MIN_DECIMAL_FOR_EACH_PRECISION,
   |                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error: use of deprecated constant `arrow::datatypes::MIN_DECIMAL_FOR_EACH_PRECISION`: Use MIN_DECIMAL_FOR_EACH_PRECISION (note indexes are different)
   --> datafusion/optimizer/src/unwrap_cast_in_comparison.rs:372:13
    |
372 |             MIN_DECIMAL_FOR_EACH_PRECISION[*precision as usize - 1],
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error: use of deprecated constant `arrow::datatypes::MAX_DECIMAL_FOR_EACH_PRECISION`: Use MAX_DECIMAL128_FOR_EACH_PRECISION (note indexes are different)
   --> datafusion/optimizer/src/unwrap_cast_in_comparison.rs:373:13
    |
373 |             MAX_DECIMAL_FOR_EACH_PRECISION[*precision as usize - 1],
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The clippy complain not related to this PR. I will be fixed after:

#14414

zhuqi-lucas · 2025-02-03T07:11:07Z

cc @alamb @adriangb
The PR is ready for review.

adriangb · 2025-02-03T14:20:54Z

Tests look good to me. I have not touched much of the optimizer, so I'm not confident reviewing the change itself.

Your explanation is great, I think it is important to record for posterity what the source cause was.
One thing that still doesn't make sense to me is why this only happens with an ORDER BY?

ozankabak · 2025-02-03T14:33:55Z

Thanks for the patch. We may be able to do this without checking for specific operators (like CoalescePartitionsExec, but using the APIs in the execution plan. @mertak-synnada, can you please review?

zhuqi-lucas · 2025-02-03T15:09:06Z

Thank you @adriangb for review. This only happen when sort with limit, because:

The following logic:

 // If we have a non-limit operator with fetch capability, update global
    // state as necessary:
    if pushdown_plan.fetch().is_some() {
        if global_state.fetch.is_none() {
            global_state.satisfied = true;
        }
        (global_state.skip, global_state.fetch) = combine_limit(
            global_state.skip,
            global_state.fetch,
            0,
            pushdown_plan.fetch(),
        );
    }

When sort with limit, the following steps cause the bug:

global_state.satisfied = true
We will remove the global limit from the original logic without this PR.
when we go to the decision for plan which can be push down(CoalescePartitionsExec can be push down after Add LimitPushdown optimization rule and CoalesceBatchesExec fetch #11652), we will not add back the global limit before CoalescePartitionsExec because the global_state.satisfied already setting to true.

      if pushdown_plan.supports_limit_pushdown() {
        if !combines_input_partitions(&pushdown_plan) {
            // We have information in the global state and the plan pushes down,
            // continue:
            Ok((Transformed::no(pushdown_plan), global_state))
        } else if let Some(plan_with_fetch) = pushdown_plan.with_fetch(skip_and_fetch) {
            // This plan is combining input partitions, so we need to add the
            // fetch info to plan if possible. If not, we must add a `LimitExec`
            // with the information from the global state.
            let mut new_plan = plan_with_fetch;
            // Execution plans can't (yet) handle skip, so if we have one,
            // we still need to add a global limit
            if global_state.skip > 0 {
                new_plan =
                    add_global_limit(new_plan, global_state.skip, global_state.fetch);
            }
            global_state.fetch = skip_and_fetch;
            global_state.skip = 0;
            global_state.satisfied = true;
            Ok((Transformed::yes(new_plan), global_state))
        } else if global_state.satisfied {
            // If the plan is already satisfied, do not add a limit:
            Ok((Transformed::no(pushdown_plan), global_state))
        } else {
            global_state.satisfied = true;
            Ok((
                Transformed::yes(add_limit(
                    pushdown_plan,
                    global_state.skip,
                    global_fetch,
                )),
                global_state,
            ))
        }
    }

zhuqi-lucas · 2025-02-03T15:13:28Z

Thank you @ozankabak for review, yeah i believe add with_fetch API for CoalescePartitionsExec may also solve this issue, and i also add it to do in the PR comments.

xudong963

Generally LGTM, I agree with @ozankabak 's suggestion.
Maybe you can also file an issue.

xudong963 · 2025-02-03T15:23:01Z

datafusion/physical-optimizer/src/limit_pushdown.rs

+        if limit_exec.input().as_any().is::<CoalescePartitionsExec>() {
+            // If the child is a `CoalescePartitionsExec`, we should not remove the limit
+            // the push_down through the `CoalescePartitionsExec` to each partition will not guarantee the limit.
+            // todo we may have a better solution if we can support with_fetch for limit inside CoalescePartitionsExec.


Thank you @xudong963 for review, change the comments, and added a follow-up issue:
#14446

zhuqi-lucas · 2025-02-03T15:42:21Z

Generally LGTM, I agree with @ozankabak 's suggestion. Maybe you can also file an issue.

Thank you @xudong963 for review, added the follow-up:
#14446

mertak-synnada · 2025-02-04T06:14:43Z

datafusion/physical-optimizer/src/limit_pushdown.rs

@@ -146,6 +146,15 @@ pub fn pushdown_limit_helper(
        global_state.skip = skip;
        global_state.fetch = fetch;

+        if limit_exec.input().as_any().is::<CoalescePartitionsExec>() {


While I agree with checking via API suggestion, please also check with the combines_input_partitions() helper function so that SortPreservingMerge can be affected as well.

In the optimizer logic, we remove the Limit operators first, and then we add them to the lowest possible point at the plan, if the plan is "satisfied" we drop the limit information. So if the plan is combining input partitions, we're only adding a global limit if skip information is there, maybe we can identify if the local limits are enough or not and then decide to add the global limit at there. But in the end, I think rather than adding a global limit, we should be able to limit in the CoalescePartitionsExec or in SortPreservingMerge so that it won't unnecessarily push more data

// Execution plans can't (yet) handle skip, so if we have one, // we still need to add a global limit if global_state.skip > 0 { new_plan = add_global_limit(new_plan, global_state.skip, global_state.fetch); }

Thank you @mertak-synnada for review:

While I agree with checking via API suggestion, please also check with the combines_input_partitions() helper function so that SortPreservingMerge can be affected as well.

I agree, i checked the SortPreservingMergeExec already, it supported with_fetch() and fetch(), so it's not affected i think?

impl SortPreservingMergeExec { /// Create a new sort execution plan pub fn new(expr: LexOrdering, input: Arc<dyn ExecutionPlan>) -> Self { let cache = Self::compute_properties(&input, expr.clone()); Self { input, expr, metrics: ExecutionPlanMetricsSet::new(), fetch: None, cache, enable_round_robin_repartition: true, } } /// Sets the number of rows to fetch pub fn with_fetch(mut self, fetch: Option<usize>) -> Self { self.fetch = fetch; self } /// Sets the selection strategy of tied winners of the loser tree algorithm /// /// If true (the default) equal output rows are placed in the merged stream /// in round robin fashion. This approach consumes input streams at more /// even rates when there are many rows with the same sort key. /// /// If false, equal output rows are always placed in the merged stream in /// the order of the inputs, resulting in potentially slower execution but a /// stable output order. pub fn with_round_robin_repartition( mut self, enable_round_robin_repartition: bool, ) -> Self { self.enable_round_robin_repartition = enable_round_robin_repartition; self } /// Input schema pub fn input(&self) -> &Arc<dyn ExecutionPlan> { &self.input } /// Sort expressions pub fn expr(&self) -> &LexOrdering { self.expr.as_ref() } /// Fetch pub fn fetch(&self) -> Option<usize> { self.fetch } /// Creates the cache object that stores the plan properties /// such as schema, equivalence properties, ordering, partitioning, etc. fn compute_properties( input: &Arc<dyn ExecutionPlan>, ordering: LexOrdering, ) -> PlanProperties { let mut eq_properties = input.equivalence_properties().clone(); eq_properties.clear_per_partition_constants(); eq_properties.add_new_orderings(vec![ordering]); PlanProperties::new( eq_properties, // Equivalence Properties Partitioning::UnknownPartitioning(1), // Output Partitioning input.pipeline_behavior(), // Pipeline Behavior input.boundedness(), // Boundedness ) } }

But in the end, I think rather than adding a global limit, we should be able to limit in the CoalescePartitionsExec or in SortPreservingMerge so that it won't unnecessarily push more data.

I totally agree this! So i created a follow-up #14446 to support limit in the CoalescePartitionsExec, SortPreservingMerge already supported this according above code.

So if the plan is combining input partitions, we're only adding a global limit if skip information is there, maybe we can identify if the local limits are enough or not and then decide to add the global limit at there.

This is a good point, we can create another issue to try to improve this!

Updated, i confirmed SortPreservingMerge works well with fetch:

# Check output plan, expect no "output_ordering" clause in the physical_plan -> ParquetExec: query TT explain with selection as ( select * from test_table ORDER BY string_col, int_col limit 1 ) select 1 as foo from selection order by string_col limit 1000; ---- logical_plan 01)Projection: foo 02)--Sort: selection.string_col ASC NULLS LAST, fetch=1000 03)----Projection: Int64(1) AS foo, selection.string_col 04)------SubqueryAlias: selection 05)--------Projection: test_table.string_col 06)----------Sort: test_table.string_col ASC NULLS LAST, test_table.int_col ASC NULLS LAST, fetch=1 07)------------TableScan: test_table projection=[int_col, string_col] physical_plan 01)ProjectionExec: expr=[foo@0 as foo] 02)--ProjectionExec: expr=[1 as foo, string_col@0 as string_col] 03)----ProjectionExec: expr=[string_col@1 as string_col] 04)------SortPreservingMergeExec: [string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST], fetch=1 05)--------SortExec: TopK(fetch=1), expr=[string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST], preserve_partitioning=[true] 06)----------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet]]}, projection=[int_col, string_col]

mertak-synnada

Thanks!

berkaysynnada · 2025-02-04T10:40:23Z

Thank you @zhuqi-lucas, @mertak-synnada and @xudong963. This looks good to me now, and I'm merging it. I guess @mertak-synnada will open a follow-up PR removing the explicit casting by utilizing a state parameter.

adriangb · 2025-02-04T22:35:21Z

Thank you all for the quick fix!

zhuqi-lucas added 4 commits February 2, 2025 23:44

fix: Limits are not applied correctly

954df19

Add easy fix

952a858

Add fix

4bb1a46

Add slt testing

45cf612

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 3, 2025

zhuqi-lucas mentioned this pull request Feb 3, 2025

Limits are not applied correctly #14406

Closed

Merge remote-tracking branch 'upstream/main' into issues-14406

881fcce

Merge remote-tracking branch 'upstream/main' into issues-14406

b978115

This was referenced Feb 3, 2025

Fix limit application #14442

Closed

Bug hotfix pydantic/datafusion#4

Draft

xudong963 self-requested a review February 3, 2025 15:15

xudong963 approved these changes Feb 3, 2025

View reviewed changes

zhuqi-lucas mentioned this pull request Feb 3, 2025

Add CoalescePartitionsExec fetch (limit) support #14446

Closed

Address comments

a538138

mertak-synnada reviewed Feb 4, 2025

View reviewed changes

zhuqi-lucas requested a review from mertak-synnada February 4, 2025 07:15

mertak-synnada approved these changes Feb 4, 2025

View reviewed changes

berkaysynnada merged commit 0d9f845 into apache:main Feb 4, 2025
25 checks passed

alamb mentioned this pull request Feb 4, 2025

Feb 4, 2025: This week(s) in DataFusion #14491

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Limits are not applied correctly #14418

fix: Limits are not applied correctly #14418

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

zhuqi-lucas commented Feb 3, 2025

adriangb commented Feb 3, 2025

ozankabak commented Feb 3, 2025

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

zhuqi-lucas commented Feb 3, 2025

xudong963 left a comment

xudong963 Feb 3, 2025

zhuqi-lucas Feb 3, 2025

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

mertak-synnada Feb 4, 2025 •

edited

Loading

zhuqi-lucas Feb 4, 2025 •

edited

Loading

zhuqi-lucas Feb 4, 2025 •

edited

Loading

mertak-synnada left a comment

berkaysynnada commented Feb 4, 2025

adriangb commented Feb 4, 2025

fix: Limits are not applied correctly #14418

fix: Limits are not applied correctly #14418

Conversation

zhuqi-lucas commented Feb 3, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

zhuqi-lucas commented Feb 3, 2025 • edited Loading

zhuqi-lucas commented Feb 3, 2025

adriangb commented Feb 3, 2025

ozankabak commented Feb 3, 2025

zhuqi-lucas commented Feb 3, 2025 • edited Loading

zhuqi-lucas commented Feb 3, 2025

xudong963 left a comment

Choose a reason for hiding this comment

xudong963 Feb 3, 2025

Choose a reason for hiding this comment

zhuqi-lucas Feb 3, 2025

Choose a reason for hiding this comment

zhuqi-lucas commented Feb 3, 2025 • edited Loading

mertak-synnada Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

mertak-synnada left a comment

Choose a reason for hiding this comment

berkaysynnada commented Feb 4, 2025

adriangb commented Feb 4, 2025

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

mertak-synnada Feb 4, 2025 •

edited

Loading

zhuqi-lucas Feb 4, 2025 •

edited

Loading

zhuqi-lucas Feb 4, 2025 •

edited

Loading