-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) #14373
Comments
Here is my email I sent to @lmwnshn about potential projects All of these projects would be written in Rust, on a production grade open source query Engine (DataFusion). Reference: https://dl.acm.org/doi/10.1145/3626246.3653368 There is significant community interest in the features too so if done well I think it is likely there would be community interaction and the code would be accepted. Implement Sideways Information Passing / Dynamic Filter Pushdown in DataFusionTicket: #7955 This project is well documented, but only partly optimizer related Students would learn: Implement LATERAL JOINs in DataFusionTicket: #10048 This one is less well specified, but if a group wants to work on this I can find time to help specify it more. Students would learn: Implement Range Joins / ASOF joinsTicket: #318 This one has had some work and even a prototype initial implementation. However, it needs help to design / explain / evaluate the existing approach. Students would learn:
|
On Please see |
I will think about some optimizer-focused projects and circle back next week. @alamb, IMHO the tickets you mention are partly optimizer related, but probably more so about the internals of the operators involved. I think we can find more optimizer-focused projects. There is so much to do there 🚀 |
That would be awesome -- I am sure @lmwnshn would appreciate any other suggestions |
I think aggregate pushdown is a potential optimizer-related project #8699 |
|
TLDR
Background
CMU's DB group is running CMU-799 OSpecial Topics in Databases: Query Optimization this Spring 2025.
I would very much like to encourage a virtuous cycle of research/academic work on DataFusion --> contributing back and thus want to make DataFusion a compelling target for this type of class. I think the more use of DataFusion the more contributions we will attract and thus the better it will get for everyone.
Andy Pavlo connected me with @lmwnshn, who is helping organize the class, and provided the following feedback
Class Projects
@lmwnshn noted that the class includes a project (see Syllabus here):
They were interested in working on some projects in DataFusion. Here were some ideas I had that I think might be good but would love to hear what other people think is important. This is also a great opportunity for others to get exposed to advanced techniques and learn how to work on large open source projects like DataFusion (obviously also how great our community is):
Ask: Does anyone else know of other interesting potential projects we can suggest??
Improving the Ease of Student Onboarding
@lmwnshn mentioned a few reasons he chose DuckDB over DataFusion for the first project(link), that were related to easier student onboarding
"Speed to running TPCH"
TPCH is an important benchmark for analytic systems. The ease of getting to the point of running TPCH is important
The DuckDB Experience:
INSTALL tpch; LOAD tpch; CALL dbgen(sf = 1);
The DataFusion Experience:
Ideas to improve the DataFusion experience:
dbgen
function todatafusion-cli
: Make it easier to run TPCH queries with datafusion-cli #14608datafusion-cli
(in addition to homebrew)Easier / Better Optimizer Configuration
DuckDB has a dedicated optimizer section, easy to selectively disable optimization, see the duckdb_optimizers() function
DataFusion has many prefixed (
datafusion.optimizer
settings)[https://datafusion.apache.org/user-guide/configs.html but it is not clear how these can enable/disable optimizer passes.Ideas to improve the DataFusion experience:
EnforceDistribution
and optimizations like `PushDownProjection). I will file a ticket / find oneBetter explain plan
EXPLAIN visualization. I think DuckDB's is easier for students to look at.
Compare: DuckDB
DataFusion
We have discussed this previously:
I also left a note with a suggested way to get started implementing this: #9371 (comment)
enable_ident_normalization
doesn't work / paper cutThere appears to (still) be some non trivial confusion about identifier normalization
For example, #9399 (comment)
I have corrected the issue #9399 (reply in thread) but clearly there is more confusion
Ideas to improve DataFusion:
datafusion-python
Substait
@lmwnshn reported that for their first project(link), they tried running substrait plans created by Apache Calcite using DataFusion. The idea was to
The good news is that DataFusion works well compared to the other system they tried, DuckDB. DuckDB could run only 1 query and DataFusion could run 15 of 21 queries (:clappy: for @vbarua, @BlizzaraB and @Lordworms!). However, 6 queries still failed so they reverted to a SQL based approach using DuckDB (see above)
He also pointed out the the official substrait consumer-testing repo simply checks that that Isthmus -> DataFusion throws an exception.:
Learnings
Apparently DuckDb can read subtrait json, this was easier for students to understand than the protobuf format / BLOB format.
I think DataFusion can read this format too (thanks @Lordworms!)
datafusion/datafusion/substrait/tests/cases/consumer_integration.rs
Line 35 in 0edc3d9
However, it is not super clear how to run such plans with
datafusion-cli
Ideas to improve DataFusion:
datafusion-cli
(follow the duckdb syntax)datafusion-cli
allowed running substrait plans via the command line (perhaps a table function?)substrait-consumer
repo to ensure / show DataFusion substrait workingThe text was updated successfully, but these errors were encountered: