[DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) #14373

alamb · 2025-01-30T16:39:42Z

TLDR

This is feedback from CMU on how to make DataFusion more compelling for Academic work.
Does anyone have suggestions about additional DataFusion projects to suggest (see below)?

Background

CMU's DB group is running CMU-799 OSpecial Topics in Databases: Query Optimization this Spring 2025.

This course ... covers the classical and state-of-the-art methods and algorithms for converting SQL statements into physical query plans. Additional topics include cost models, feedback mechanisms, and adaptive query optimization. All class projects will be in the context of an open-source query optimizer service using real-world queries. The course is appropriate for graduate students in software systems and advanced undergraduates with nasty programming skills that are pursuing a database-centric lifestyle.

I would very much like to encourage a virtuous cycle of research/academic work on DataFusion --> contributing back and thus want to make DataFusion a compelling target for this type of class. I think the more use of DataFusion the more contributions we will attract and thus the better it will get for everyone.

Andy Pavlo connected me with @lmwnshn, who is helping organize the class, and provided the following feedback

Class Projects

@lmwnshn noted that the class includes a project (see Syllabus here):

The main component of this course will be the group programming project. Students will organize into groups and to implement a large system / prototype. The projects are designed to be (1) relevant to the materials discussed in class and (2) require a significant programming effort from all team members.

Release Date: Feb 17, 2025
Due Date: May 01, 2025

They were interested in working on some projects in DataFusion. Here were some ideas I had that I think might be good but would love to hear what other people think is important. This is also a great opportunity for others to get exposed to advanced techniques and learn how to work on large open source projects like DataFusion (obviously also how great our community is):

Dynamic Filter Pushdown: Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) #7955
LATERAL JOINs: Feature request: Support for lateral joins #10048
Range Joins / ASOF joins: ASOF join support / Specialize Range Joins #318

Ask: Does anyone else know of other interesting potential projects we can suggest??

Improving the Ease of Student Onboarding

@lmwnshn mentioned a few reasons he chose DuckDB over DataFusion for the first project(link), that were related to easier student onboarding

"Speed to running TPCH"

TPCH is an important benchmark for analytic systems. The ease of getting to the point of running TPCH is important

The DuckDB Experience:

Download the duckdb binary
Run commands INSTALL tpch; LOAD tpch; CALL dbgen(sf = 1);

The DataFusion Experience:

install Rust,
install datafusion-cli,
give them background knowledge about TPC-H
point them to bench.sh (~750 LOC)

Ideas to improve the DataFusion experience:

Add built in dbgen function to datafusion-cli: Make it easier to run TPCH queries with datafusion-cli #14608
Add more Pre-built binaries for datafusion-cli (in addition to homebrew)
Improve getting started instructions to make the process easier / more copy/paste

Easier / Better Optimizer Configuration

DuckDB has a dedicated optimizer section, easy to selectively disable optimization, see the duckdb_optimizers() function

DataFusion has many prefixed (datafusion.optimizer settings)[https://datafusion.apache.org/user-guide/configs.html but it is not clear how these can enable/disable optimizer passes.

Ideas to improve the DataFusion experience:

Make it easier to disable just the optimizers from datafusion-cli (I think this would require making a better distinction between passes required to run like EnforceDistribution and optimizations like `PushDownProjection). I will file a ticket / find one

Better explain plan

EXPLAIN visualization. I think DuckDB's is easier for students to look at.

Compare: DuckDB


┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
...
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             k             │
│      length(Referer)      │
│          Referer          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│      (Referer != '')      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 99997497       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          Referer          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 99997497       │
└───────────────────────────┘

DataFusion

+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Limit: skip=0, fetch=25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |   Sort: l DESC NULLS FIRST, fetch=25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |     Projection: regexp_replace(hits.parquet.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1")) AS k, AVG(character_length(hits.parquet.Referer)) AS l, COUNT(*) AS c, MIN(hits.parquet.Referer)                                                                                                                                                                                                                                                                                                                                              |
|               |       Filter: COUNT(*) > Int64(100000)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|               |         Aggregate: groupBy=[[regexp_replace(hits.parquet.Referer, Utf8("^https?://(?:www\.)?([^/]+)/.*$"), Utf8("\1"))]], aggr=[[AVG(CAST(character_length(hits.parquet.Referer) AS Float64)), COUNT(UInt8(1)) AS COUNT(*), MIN(hits.parquet.Referer)]]                                                                                                                                                                                                                                                                                               |
|               |           Filter: hits.parquet.Referer != Utf8("")                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|               |             TableScan: hits.parquet projection=[Referer], partial_filters=[hits.parquet.Referer != Utf8("")]                                                                                                                                                                                                                                                                                                                                                                                                                                          |
...

We have discussed this previously:

Implement Nicer / DuckDB style explain plans #9371

I also left a note with a suggested way to get started implementing this: #9371 (comment)

`enable_ident_normalization` doesn't work / paper cut

There appears to (still) be some non trivial confusion about identifier normalization

For example, #9399 (comment)

I have corrected the issue #9399 (reply in thread) but clearly there is more confusion

Ideas to improve DataFusion:

Consolidate examples
Improve documentation and
Examples in datafusion-python

Substait

@lmwnshn reported that for their first project(link), they tried running substrait plans created by Apache Calcite using DataFusion. The idea was to

Use Calcite as frontend SQL --> substrait
Take the optimized queries --> run on the engine
turn on / off optimzer passes

The good news is that DataFusion works well compared to the other system they tried, DuckDB. DuckDB could run only 1 query and DataFusion could run 15 of 21 queries (:clappy: for @vbarua, @BlizzaraB and @Lordworms!). However, 6 queries still failed so they reverted to a SQL based approach using DuckDB (see above)

He also pointed out the the official substrait consumer-testing repo simply checks that that Isthmus -> DataFusion throws an exception.:

https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/snapshots/results/integration/tpch

Learnings

Apparently DuckDb can read subtrait json, this was easier for students to understand than the protobuf format / BLOB format.

I think DataFusion can read this format too (thanks @Lordworms!)

For example: https://github.com/apache/datafusion/blob/0edc3d99ad63399696135d3bb3fc387b38803d1f/datafusion/substrait/tests/testdata/tpch_substrait_plans/query_01_plan.json (read by

datafusion/datafusion/substrait/tests/cases/consumer_integration.rs

Line 35 in 0edc3d9

async fn tpch_plan_to_string(query_id: i32) -> Result<String> {

)

However, it is not super clear how to run such plans with datafusion-cli

Ideas to improve DataFusion:

Consider adding some way to use substrait plans via datafusion-cli (follow the duckdb syntax)
It might make this more apparent if datafusion-cli allowed running substrait plans via the command line (perhaps a table function?)
Implement / update the substrait-consumer repo to ensure / show DataFusion substrait working

The text was updated successfully, but these errors were encountered:

alamb · 2025-01-30T16:53:34Z

Here is my email I sent to @lmwnshn about potential projects

All of these projects would be written in Rust, on a production grade open source query Engine (DataFusion).

Reference: https://dl.acm.org/doi/10.1145/3626246.3653368

There is significant community interest in the features too so if done well I think it is likely there would be community interaction and the code would be accepted.

Implement Sideways Information Passing / Dynamic Filter Pushdown in DataFusion

Ticket: #7955

This project is well documented, but only partly optimizer related

Students would learn:
** Expression representation,
** Extending Database Optimizer rules (pushing predicates + join restrictions)
** Benchmarking,
** Extending physical plans / Join code
** working with open source community (I think there are several people who are interested in helping this along)
** the classic "Database lifestyle" rush of making TPCH queries faster (and wondering if the optimizations apply to other workloads)

Implement LATERAL JOINs in DataFusion

Ticket: #10048

This one is less well specified, but if a group wants to work on this I can find time to help specify it more.

Students would learn:
** The wonders of subqueries, and a visceral understanding of their relation to joins
** subquery decorrelation / rewrites
** extending optimizer rules
** would need: some additional subquery decorrelation optimizer code (and possibly some physical operator support)

Implement Range Joins / ASOF joins

Ticket: #318

This one has had some work and even a prototype initial implementation. However, it needs help to design / explain / evaluate the existing approach.

Students would learn:

What a Range Join is, how it works, and how it could be implemented
How to specify and describe a new feature
How to work with existing code to push the feature through

lmwnshn · 2025-01-30T17:30:45Z

On enable_ident_normalization: I think there may be some extra complications in my situation arising from the use of Substrait (or perhaps Substrait table names are treated as quoted? or perhaps I misunderstand the purpose of the flag - entirely possible!).

Please see
https://github.com/lmwnshn/15799-s25-project1-remnants/blob/main/run_datafusion_ident.py#L37
Even with enable_ident_normalization=false, lowercase tablenames and lowercase column names in the parquet files as shown here results in Exception: DataFusion error: Plan("No table named 'LINEITEM'"):
https://github.com/lmwnshn/15799-s25-project1-remnants/blob/main/run_datafusion_ident.py#L12-L19
So I hacked the parquet files up a bit
https://github.com/lmwnshn/15799-s25-project1-remnants/blob/main/fix_parquet.py#L11-L23
and switched to uppercased column names + register tables as uppercase
https://github.com/lmwnshn/15799-s25-project1-remnants/blob/main/run_datafusion.py#L12-L19
to get the Substrait plan to execute successfully.

ozankabak · 2025-02-02T06:49:57Z

I will think about some optimizer-focused projects and circle back next week. @alamb, IMHO the tickets you mention are partly optimizer related, but probably more so about the internals of the operators involved. I think we can find more optimizer-focused projects. There is so much to do there 🚀

alamb · 2025-02-02T12:33:54Z

I think we can find more optimizer-focused projects. There is so much to do there 🚀

That would be awesome -- I am sure @lmwnshn would appreciate any other suggestions

2010YOUY01 · 2025-02-07T08:23:45Z

I think aggregate pushdown is a potential optimizer-related project #8699

alamb · 2025-02-11T12:39:34Z

I filed Make it easier to run TPCH queries with datafusion-cli #14608 to track the idea of making it easier to run tpch in datafusion-cli

alamb assigned alamb and unassigned alamb Jan 30, 2025

alamb mentioned this issue Jan 30, 2025

Jan 18, 2025: This week(s) in DataFusion #14179

Closed

Omega359 mentioned this issue Feb 4, 2025

Project Ideas for GSoC 2025 (Google Summer of Code) #14478

Open

alamb mentioned this issue Feb 4, 2025

Feb 4, 2025: This week(s) in DataFusion #14491

Open

alamb mentioned this issue Feb 11, 2025

Make it easier to run TPCH queries with datafusion-cli #14608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) #14373

[DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) #14373

alamb commented Jan 30, 2025 •

edited

Loading

alamb commented Jan 30, 2025

lmwnshn commented Jan 30, 2025

ozankabak commented Feb 2, 2025

alamb commented Feb 2, 2025

2010YOUY01 commented Feb 7, 2025

alamb commented Feb 11, 2025

[DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) #14373

[DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) #14373

Comments

alamb commented Jan 30, 2025 • edited Loading

TLDR

Background

Class Projects

Improving the Ease of Student Onboarding

"Speed to running TPCH"

Easier / Better Optimizer Configuration

Better explain plan

enable_ident_normalization doesn't work / paper cut

Substait

Learnings

alamb commented Jan 30, 2025

Implement Sideways Information Passing / Dynamic Filter Pushdown in DataFusion

Implement LATERAL JOINs in DataFusion

Implement Range Joins / ASOF joins

lmwnshn commented Jan 30, 2025

ozankabak commented Feb 2, 2025

alamb commented Feb 2, 2025

2010YOUY01 commented Feb 7, 2025

alamb commented Feb 11, 2025

alamb commented Jan 30, 2025 •

edited

Loading

`enable_ident_normalization` doesn't work / paper cut