feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

JonasDev1 · 2024-05-14T14:36:01Z

Description

This pr improves the merging performance by adding min/max filters to the early filter.
The number of files scanned from the target file table is reduced by using the table statistics.
I have extended the early filter for this purpose. This filter is responsible for pre-filtering the target table.
Previously, the early filter only consisted of partition columns by filtering for all unique values from the source. Now the non-partition columns are also used by aggregating the min/max values from the source and adding a between expression to the early filter.

It is also automatically part of the conflict detection based on the predicate.

I added a property extended_early_filter to make this advanced filtering optional. I don't know if this is important, and maybe we can replace the bool with an enum. What do you think about this?

Example:

Merge into table t with partition date

Predicate: source.date = target.date and source.timestamp = target.timestamp and source.id = target.id and frob > 42

Early filter before: date = '2024-‚05-14' and frob > 42
Early filter now: date = '2024-05-14' and timestamp BETWEEN '…15:00' AND '…15:05' and id BETWEEN 'A' AND 'B' and frob > 42

github-actions · 2024-05-14T14:36:22Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

JonasDev1 · 2024-05-14T14:39:04Z

#2411

ion-elgreco · 2024-05-17T23:04:30Z

@JonasDev1 why did you make the advanced filtering optional?

If this provides better performance across the board, we should enable it always (so also for python bindings)

JonasDev1 · 2024-05-22T07:41:13Z

My concern was if you want to do e.g. merges via columns with null this would not work, but I think that it would not work without the advanced filtering either as is equal for null is not defined in sql.

Spark has an extra null safe operator <=> for this, which is not in Datafusion available yet.

JonasDev1 · 2024-05-24T13:41:16Z

What about the review?

I can of course also remove the flag again

ion-elgreco · 2024-06-04T10:40:58Z

My concern was if you want to do e.g. merges via columns with null this would not work, but I think that it would not work without the advanced filtering either as is equal for null is not defined in sql.

Spark has an extra null safe operator <=> for this, which is not in Datafusion available yet.

My main issue is that it might work or not work based on the contents of the data. I think that's a bit tricky because a person needs to be aware of the contents of their data they are trying to write.

<=> for this, which is not in Datafusion available yet. @JonasDev1 can you maybe raise an issue about this upstream in datafusion, so we get this first

thomasfrederikhoeck · 2024-06-13T06:09:15Z

@JonasDev1 do you know if the issue was raised in DataFusion? I would love to see this feature to really enhance large table performance in the cases where partition is not meaningful due to high cardinality so it is very nice that you have initiated it.

JonasDev1 · 2024-06-13T07:31:51Z

@ion-elgreco I have tested the merge with null values in merge predicate with extend filtering and without. In both cases the merge doesn't work as expected and will lead to duplicate rows. Therefore this pull request will create no changes and the behaviour is simillar to Spark.

The null-safe operation would be a nice extension, but is not bounded to this pull request

ion-elgreco · 2024-07-24T17:01:41Z

@JonasDev1 can you resolve the merge conflicts? Then I'll do another round of reviewing

JonasDev1 · 2024-07-30T12:41:53Z

@ion-elgreco Done

crates/core/src/operations/merge/mod.rs

ion-elgreco

Thankss, and sorry for the delay! @JonasDev1

ion-elgreco · 2024-08-01T13:10:24Z

@rtyler can't a branch be merged if it can't be rebased anymore?

@JonasDev1 are you perhaps able to manually rebase and force push it?

…mns min/max for prefiltering

JonasDev1 requested review from wjones127, roeap and rtyler as code owners May 14, 2024 14:36

github-actions bot added the binding/rust Issues for the Rust crate label May 14, 2024

JonasDev1 changed the title ~~feat: Improve merge performance by using predicate non-partition columns min/max for prefiltering~~ feat: improve merge performance by using predicate non-partition columns min/max for prefiltering May 14, 2024

ion-elgreco requested a review from Blajda May 14, 2024 15:03

JonasDev1 requested a review from hntd187 as a code owner July 30, 2024 12:27

ion-elgreco reviewed Aug 1, 2024

View reviewed changes

crates/core/src/operations/merge/mod.rs Outdated Show resolved Hide resolved

ion-elgreco enabled auto-merge August 1, 2024 09:16

ion-elgreco approved these changes Aug 1, 2024

View reviewed changes

ion-elgreco added this pull request to the merge queue Aug 1, 2024

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Aug 1, 2024

ion-elgreco added this pull request to the merge queue Aug 1, 2024

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Aug 1, 2024

ion-elgreco added this to the python v0.19 milestone Aug 3, 2024

feat: improve merge performance by using predicate non-partition colu…

c2cf650

…mns min/max for prefiltering

ion-elgreco force-pushed the merge-non-partition-col-filtering branch from 32743e4 to c2cf650 Compare August 3, 2024 09:54

ion-elgreco enabled auto-merge August 3, 2024 09:54

ion-elgreco added this pull request to the merge queue Aug 3, 2024

Merged via the queue into delta-io:main with commit e92ec86 Aug 3, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

JonasDev1 commented May 14, 2024

github-actions bot commented May 14, 2024

JonasDev1 commented May 14, 2024

ion-elgreco commented May 17, 2024

JonasDev1 commented May 22, 2024

JonasDev1 commented May 24, 2024

ion-elgreco commented Jun 4, 2024

thomasfrederikhoeck commented Jun 13, 2024

JonasDev1 commented Jun 13, 2024

ion-elgreco commented Jul 24, 2024

JonasDev1 commented Jul 30, 2024

ion-elgreco left a comment

ion-elgreco commented Aug 1, 2024

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

Conversation

JonasDev1 commented May 14, 2024

Description

Example:

github-actions bot commented May 14, 2024

JonasDev1 commented May 14, 2024

ion-elgreco commented May 17, 2024

JonasDev1 commented May 22, 2024

JonasDev1 commented May 24, 2024

ion-elgreco commented Jun 4, 2024

thomasfrederikhoeck commented Jun 13, 2024

JonasDev1 commented Jun 13, 2024

ion-elgreco commented Jul 24, 2024

JonasDev1 commented Jul 30, 2024

ion-elgreco left a comment

Choose a reason for hiding this comment

ion-elgreco commented Aug 1, 2024