Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update quarterly roadmap for Q2 #2133

Merged
merged 4 commits into from
Apr 4, 2022
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 46 additions & 33 deletions docs/source/specification/quarterly_roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,52 +21,65 @@

A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.

## 2022 Q1
## 2022 Q2

### DataFusion Core

- Publish official Arrow2 branch
- Implementation of memory manager (i.e. to enable spilling to disk as needed)
- IO Improvements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @tustvold

Copy link
Contributor

@tustvold tustvold Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure what this specifically is referring to, but I definitely intend to focus on improving the IO and scheduling stories in arrow-rs and DataFusion. See apache/arrow-rs#1473 and #2079. Not sure if we want to explicitly call out the scheduling side of this.

I may also get to proper filter pushdown to parquet if I have time - apache/arrow-rs#1191

Edit: I've proposed a change with a very high-level statement of what I hope to achieve w.r.t scheduling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tustvold. I plan on finishing the work summarized on #1777 which is what that refers to

- Reading, registering, and writing more file formats from both DataFrame API and SQL
- Additional options for IO including partitioning and metadata support
- Memory Management
Copy link
Contributor

@tustvold tustvold Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Memory Management
- Work Scheduling
- Improve predictability, observability and performance of IO and CPU-bound work
- Develop a more explicit story for managing parallelism during plan execution
- Memory Management

I've yet to create a ticket for this, as I'm still exploring the problem domain, but the precursor discussions can be found apache/arrow-rs#1473 and #2079.

- Add more operators for memory limited execution
- Performance
- Incorporate row-format into operators such as aggregate
- Add row-format benchmarks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Add row-format benchmarks
- Add row-format benchmarks
- Explore JIT-compiling complex expressions

- Explore LLVM for JIT, with inline Rust functions as the primary goal
- Documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Documentation
- Improve performance of Sort and Merge using Row Format / JIT expressions
- Documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to contribute improvements to the Sort performance (especially for multi-column sorts that include strings) this quarter as well. I don't have any writeup of that yet

- General improvements to DataFusion website
- Publish design documents
- Streaming
- Create `StreamProvider` trait

### Benchmarking
### Ballista

- Inclusion in Db-Benchmark with all quries covered
- All TPCH queries covered
- Make production ready
- Shuffle file cleanup
- Fill functional gaps between DataFusion and Ballista
- Improve task scheduling and data exchange efficiency
- Better error handling
- Task failure
- Executor lost
- Schedule restart
- Improve monitoring and logging
- Auto scaling support
- Support for multi-scheduler deployments. Initially for resiliency and fault tolerance but ultimately to support sharding for scalability and more efficient caching.
- Executor deployment grouping based on resource allocation

### Performance Improvements
### Extensions ([datafusion-contrib](https://github.com/datafusion-contrib]))

- Predicate evaluation
- Improve multi-column comparisons (that can't be vectorized at the moment)
- Null constant support
#### [DataFusion-Python](https://github.com/datafusion-contrib/datafusion-python)

### New Features
- Add missing functionality to DataFrame and SessionContext
- Improve documentation

- Read JSON as table
- Simplify DDL with DataFusion-Cli
- Add Decimal128 data type and the attendant features such as Arrow Kernel and UDF support
- Add new experimental e-graph based optimizer
#### [DataFusion-S3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)

### Ballista
- Create Python bindings to use with datafusion-python

- Begin work on design documents and plan / priorities for development

### Extensions ([datafusion-contrib](https://github.com/datafusion-contrib]))
#### [DataFusion-Tui](https://github.com/datafusion-contrib/datafusion-tui)

- Stable S3 support
- Begin design discussions and prototyping of a stream provider
- Create multiple SQL editors
- Expose more Context and query metadata
- Support new data sources
- BigTable, HDFS, HTTP APIs

## Beyond 2022 Q1
#### [DataFusion-BigTable](https://github.com/datafusion-contrib/datafusion-bigtable)

There is no clear timeline for the below, but community members have expressed interest in working on these topics.
- Python binding to use with datafusion-python
- Timestamp range predicate pushdown
- Multi-threaded partition aware execution
- Production ready Rust SDK

### DataFusion Core

- Custom SQL support
- Split DataFusion into multiple crates
- Push based query execution and code generation

### Ballista
#### [DataFusion-Streams](https://github.com/datafusion-contrib/datafusion-streams)

- Evolve architecture so that it can be deployed in a multi-tenant cloud native environment
- Ensure Ballista is scalable, elastic, and stable for production usage
- Develop distributed ML capabilities
- Create experimental implementation of `StreamProvider` trait