Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve median with no grouping by 2X #14399

Merged
merged 2 commits into from
Feb 2, 2025

Conversation

2010YOUY01
Copy link
Contributor

Which issue does this PR close?

This is a small (one 20-lines change) follow-on to #13681

Rationale for this change

@Rachelint (in #13681) makes median() aggregate function significantly faster by implementing GroupsAccumulator
However, regular Accumulator is still used in no-grouping aggregates and window expressions, this PR took some code from #13681 to also improve median()'s regular Accumulator implementation

Benchmarks:

DataFusion CLI v44.0.0
> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR,
) STORED AS parquet
LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10/lineitem';
0 row(s) fetched.
Elapsed 0.011 seconds.

> select median(l_orderkey) from lineitem;
+-----------------------------+
| median(lineitem.l_orderkey) |
+-----------------------------+
| 29993285                    |
+-----------------------------+
1 row(s) fetched.
Elapsed 0.529 seconds.

PR: 0.5s
main: 1s

What changes are included in this PR?

Changed state() implementation in median's Accumulator

Are these changes tested?

existing tests

Are there any user-facing changes?

No


// Build inner array
let values_array =
PrimitiveArray::<T>::new(ScalarBuffer::from(self.all_values.clone()), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can take all_values directly, and reduce the allocation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks!

Copy link
Contributor

@Rachelint Rachelint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 , it is really good.

@alamb alamb changed the title perf: 2X faster no grouping median() function perf: Improve median with no grouping by 2X Feb 1, 2025
@Rachelint Rachelint merged commit 1e0531f into apache:main Feb 2, 2025
26 checks passed
cj-zhukov pushed a commit to cj-zhukov/datafusion that referenced this pull request Feb 3, 2025
* improve median() no grouping case

* review
@alamb
Copy link
Contributor

alamb commented Feb 3, 2025

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants