-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Datafusion solution [updated] #240
base: master
Are you sure you want to change the base?
Conversation
@Dandandan fyi took a first stab at group by q8.
results currently similar to spark |
Nice! The spark solution has |
@Dandandan FYI i migrated to the python bindings, should make integrating with their flow easier as im using the existing python helpers. I still have to migrate the join suite. let me know if any thoughts. results below - something odd going on with Q10 maybe?
|
ans = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1").collect() | ||
t = timeit.default_timer() - t_start | ||
print(t) | ||
shape = ans_shape(ans) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For every solution in this benchmark checking shape is a part of timing, to ensure no laziness happens. I can imagine data fusion is not lazy, yet it seems to be unfair to skip this step in the timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll update!
@jangorecki ive made a number of updates including adding datafusion to some of your utilities / runners which will hopefully make your life easier. would you be able to see how close this is? one thing i havent been able to test locally is running against the larger datasets so im not sure if / what errors we may get on those. do you have a recommendation for how to handle? thanks for your help! |
hi @jangorecki - just checking in on this and if there is anything i can do to help. as some additional context, datafusion has / will soon have several new features that will improve our query coverage and likely performance. from your perspective would you rather we submit once those are all completed or can we get the current submission merged as is and iterate from there? thanks! |
I am no longer a maintainer of this project as I don't work for H2O anymore. I would start by contacting maintainer of the project to ensure that effort you are going to undertake will be merged in. H2O support is very helpful so you should not have problems about finding out who now takes care of the project. Aside from support channel you should also easily reach h2o on twitter etc. |
@jangorecki thank you for your work on this and for letting us know :) i will reach out to H2O for support. |
Updated PR to get Datafusion added to benchmarks.
Right now missing group by queries 6,8, and 9. I am going to look into those missing queries and then start looking into the flow / required output.
Let me know if anything in particular would make your life easier to add this :)
One question - can someone just confirm that this will be able to be run with cargo? Similar to the work @Dandandan did (I picked up from there) I am running the queries with the below commands: