Please read my article How to build a modern distributed compute platform since it is a good introduction to how I think Ballista (and other distributed compute platforms) should work. This article is a work in progress that I update from time to time, as I learn more about this subject, or when I feel motivated to write.
There is also a wiki with a list of interesting reading material.
This project depends on some existing technologies, so it is a good idea to learn a little about those too:
Ballista will extend DataFusion to support distributed query execution of DataFusion queries by providing the following components:
- Serde code to support serialization and deserialization of logical and physical query plans in protocol buffer format (so that full or partial query plans can be sent between processes).
- Executor process implementing the Flight protocol that can receive a query plan and execute it.
- Shuffle write operator that can store the partitioned output of a query in the executor’s memory, or persist to the file system.
- Shuffle read operator than can read shuffle partitions from other executors in the cluster.
- Distributed query planner / scheduler that will start with a DataFusion physical plan and insert shuffle read and write operators as necessary and then execute the query stages.
- Kubernetes/Etcd support so that clusters can easily be created.
We have a Gitter IM room for discussing this project as well as a discord channel.
See the current milestones and issues here. I recommend starting here when contributing because there is a plan in place for delivering useful point solutions along the way as the project heads towards a v1.0 release.
This project uses the standard GitHub Forking Workflow.
See the developer docs for instructions on setting up a local build environment.