Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update main DataFusion README #4903

Merged
merged 9 commits into from
Jan 17, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 104 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,34 +21,52 @@

<img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256" alt="logo"/>

DataFusion is an extensible query planning, optimization, and execution framework, written in
Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
DataFusion is very fast, extensible query engine, for building high quality data centric systems in
alamb marked this conversation as resolved.
Show resolved Hide resolved
[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org)
in-memory format.

DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built in support for CSV, Parquet Json, and Avro, extensive customization, and a great community.
alamb marked this conversation as resolved.
Show resolved Hide resolved

[![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master)

## Features

- SQL query planner with support for multiple SQL dialects
- DataFrame API
- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom
file formats can be supported by implementing a `TableProvider` trait.
- Supports popular object stores, including AWS S3, Azure Blob
Storage, and Google Cloud Storage. There are extension points for implementing
custom object stores.
- Feature rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html)
alamb marked this conversation as resolved.
Show resolved Hide resolved
- Blazingly fast, vectorized, multi-threaded, streaming execution engine.
- Native support for Parquet, CSV, JSON, and Avro file formats. Support
for custom file formats and non file datasources via the `TableProvider` trait.
- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
other query languages, custom plan and execution nodes, optimizer passes, and more.
- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
`ObjectStore` trait.
- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
[welcoming community](https://arrow.apache.org/datafusion/community/communication.html).
- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
automatic join reordering, expression coercion, and more.
- Permissive Apache 2.0 License, Apache Software Foundation governance
- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
producticity similar to Java or golang, the performance of C++, and
alamb marked this conversation as resolved.
Show resolved Hide resolved
[loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).

## Use Cases

DataFusion is modular in design with many extension points and can be
used without modification as an embedded query engine and can also provide
a foundation for building new systems. Here are some example use cases:
DataFusion can be used without modification as an embedded SQL
engine or can be customized and used as a foundation for
building new systems. Here are some examples of systems built using DataFusion:

- Specialized Analytical Database systems such as [CeresDB] and more general spark like system such a [Ballista].
- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
- Research platform for new Database Systems, such as [Flock]
- SQL support to another library, such as [dask sql]
- Streaming data platforms such as [Synnada]
- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
- A faster Spark runtime replacement (blaze-rs)

- DataFusion can be used as a SQL query planner and query optimizer, providing
optimized logical plans that can then be mapped to other execution engines.
- DataFusion is used to create modern, fast and efficient data
pipelines, ETL processes, and database systems, which need the
performance of Rust and Apache Arrow and want to provide their users
the convenience of an SQL interface or a DataFrame API.
By using DataFusion, the projects are freed to focus on their specific
features, and avoid reimplementing general (but still necessary)
features such as an expression representation, standard optimizations,
execution plans, file format support, etc.

## Why DataFusion?

Expand All @@ -57,9 +75,31 @@ a foundation for building new systems. Here are some example use cases:
- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

## Comparisons with other projects
alamb marked this conversation as resolved.
Show resolved Hide resolved

Here is a comparison with similar projects that may help understand
when DataFusion might be be suitable and unsuitable for your needs:

- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.
Like DataFusion, it supports very fast execution, both from its custom file format
and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
is primarily used directly by users as a serverless database and query system rather
than as a library for building such database systems.

- [pola.rs](http://pola.rs): Polars is one of the fastest DataFrame libraries at the time
of writing. Like DataFusion, it is also written in Rust but unlike DataFusion
it does not provide SQL nor many extension points.

- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/)
is an execution engine. Like DataFusion, Velox aims to
provide a reusable foundation for building database-like systems. Unlike DataFusion,
it is written in C/C++ and does not include a SQL frontend or planning /optimization
framework.

## DataFusion Community Extensions

There are a number of community projects that extend DataFusion or provide integrations with other systems.
There are a number of community projects that extend DataFusion or
provide integrations with other systems.

### Language Bindings

Expand All @@ -78,29 +118,51 @@ There are a number of community projects that extend DataFusion or provide integ

Here are some of the projects known to use DataFusion:

- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine
- [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core
- [CeresDB](https://github.com/CeresDB/ceresdb) Distributed Time-Series Database
- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
- [CnosDB](https://github.com/cnosdb/cnosdb) Open Source Distributed Time Series Database
- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
- [Dask SQL](https://github.com/dask-contrib/dask-sql) Distributed SQL query engine in Python
- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion
- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake
- [Flock](https://github.com/flock-lab/flock)
- [Greptime DB](https://github.com/GreptimeTeam/greptimedb) Open Source & Cloud Native Distributed Time Series Database
- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
- [Parseable](https://github.com/parseablehq/parseable) Log storage and observability platform
- [qv](https://github.com/timvw/qv) Quickly view your data
- [ROAPI](https://github.com/roapi/roapi)
- [Seafowl](https://github.com/splitgraph/seafowl) CDN-friendly analytical database
- [Synnada](https://synnada.ai/) Streaming-first framework for data products
- [Tensorbase](https://github.com/tensorbase/tensorbase)
- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar

(if you know of another project, please submit a PR to add a link!)

## Example Usage
- [Ballista] Distributed SQL Query Engine
andygrove marked this conversation as resolved.
Show resolved Hide resolved
- [Blaze] Spark accelerator with DataFusion at its core
- [CeresDB] Distributed Time-Series Database
- [Cloudfuse Buzz]
- [CnosDB] Open Source Distributed Time Series Database
- [Cube Store]
- [Dask SQL] Distributed SQL query engine in Python
- [datafusion-tui] Text UI for DataFusion
- [delta-rs] Native Rust implementation of Delta Lake
- [Flock] Cloud database research system
- [Kamu] Planet-scale streaming data pipeline
- [Greptime DB] Open Source & Cloud Native Distributed Time Series Database
- [InfluxDB IOx] Time Series Database
- [Parseable] Log storage and observability platform
- [qv] Quickly view your data
- [prql-query]: Query and transform data with PRQL
- [ROAPI]: Automatic read-only APIs for static datasets
- [Seafowl] CDN-friendly analytical database
- [Synnada] Streaming-first framework for data products
- [Tensorbase]
- [VegaFusion] Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar

[ballista]: https://github.com/apache/arrow-ballista
[blaze]: https://github.com/blaze-init/blaze
[ceresdb]: https://github.com/CeresDB/ceresdb
[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust
[cnosdb]: https://github.com/cnosdb/cnosdb
[cube store]: https://github.com/cube-js/cube.js/tree/master/rust
[dask sql]: https://github.com/dask-contrib/dask-sql
[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui
[delta-rs]: https://github.com/delta-io/delta-rs
[flock]: https://github.com/flock-lab/flock
[kamu]: https://github.com/kamu-data/kamu-cli
[greptime db]: https://github.com/GreptimeTeam/greptimedb
[influxdb iox]: https://github.com/influxdata/influxdb_iox
[parseable]: https://github.com/parseablehq/parseable
[prql-query]: https://github.com/prql/prql-query
[qv]: https://github.com/timvw/qv
[roapi]: https://github.com/roapi/roapi
[seafowl]: https://github.com/splitgraph/seafowl
[synnada]: https://synnada.ai/
[tensorbase]: https://github.com/tensorbase/tensorbase
[vegafusion]: https://vegafusion.io/ "if you know of another project, please submit a PR to add a link!"

## Examples

Please see the [example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) in the user guide and the [datafusion-examples](https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples) crate for more information on how to use DataFusion.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/user-guide/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ in-memory format.
DataFusion supports both an SQL and a DataFrame API for building
logical query plans as well as a query optimizer and execution engine
capable of parallel execution against partitioned data sources (CSV
and Parquet) using threads.
and Parquet) using
alamb marked this conversation as resolved.
Show resolved Hide resolved

## Use Cases

Expand Down