-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Improved performance in H20.ai benchmarks #13548
Comments
For posterity, here is a link to the discord chat: https://discord.com/channels/885562378132000778/1309883046886903870/1309887744595595324 |
Would like to note that the DataFusion performance really starts to lag when the dataset size grows. Take a look at this query: When the dataset is 10 million rows, then Polars takes 3 seconds and DataFusion takes 3.6 seconds, so pretty similar. When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds. |
What version are you working with? @Rachelint has some ideas of how to improve this: |
Hm this seems something quadratic in nature?
Does it fully explain the dramatic difference? @MrPowers how do you generate the 10M vs 100M rows? |
I would also expect this to help (but it was merged and depends on when it is merged)
|
@Dandandan - thanks to the great work by @SemyonSinchenko, it's easy to generate these datasets with falsa. Here's the command to generate the 10 million row dataset: |
Thanks, I will profile and see what happen about the so long time cost in datafusion. |
🤔 I guess it may be caused by the similar reason of what we encountered during benchmarking in #11827 |
Specifically that |
I am not sure, but I think it maybe really related to |
I rerun the H2O Q9 with
I didn't reproduce the drastic slowdown in
|
|
Looking at the benchmark results, I think query 8 is worth analyzing / optimizing as well: |
This explains 👍🏼 I ran the benchmark on a macbook with 48G of ram. We should also take a look at how much memory does DataFusion consume for those queries, comparing to other systems. Thanks for the report. |
Yes, I run it today, and my machine has only 16GB memory too... and I found the query very very slow due to swapping, too... |
I think making DataFusion work better in lower memory situations would certainly be nice |
Update here is that @2010YOUY01 has made I also dug up and connected the task to add the H20.ai queries to bench.sh. Check out |
Is your feature request related to a problem or challenge?
The basic aggregate functions like
COUNT
andSUM
in DataFusion are very fast (see Apache DataFusion is now the fastest single node engine for querying Apache Parquet files)However, many of the other aggregate functions are not particularly fast, and this shows up specifically on some of the H20 benchmarks
We saw this in the results in the 2024 DataFusion SIGMOD paper
(BTW we have made median faster)
@MrPowers has also observed similar results on discord (link):
See his version of the benchmarks here
https://github.com/MrPowers/mrpowers-benchmarks
Testing
dfbench
#7209Functions
corr
function #13549median
function #13550Other improvements
Describe the solution you'd like
DataFusion has two APIs ways to implement Aggregate functions like
SUM
andCOUNT
Accumulator
(api docs)GroupsAccumulator
(api docs)The basic aggregates are implemented using
GroupsAccumulator
and are part of DataFusions performanceThis ticket tracks the effort to improve the performance of these for these "more advanced" aggregate functions, likely by implementing
GroupsAccumulator
Describe alternatives you've considered
For each function listed above, ideally we would:
GroupsAccumulator
for the relevant aggregate function in a second PR (along with tests for correctness). We would use the benchmark to verify the performanceHere is a pretty good example of how @eejbyfeldt did this for
STDDEV
:Additional context
No response
The text was updated successfully, but these errors were encountered: