-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Make DataFusion a reliable foundation for building query engines #12723
Comments
This certainly sounds ambitious (but good!) Another specific change towards this goal that might be worth considering (and that came in @andygrove 's CMU talk about comet) would be user defined coercion rules The coerercion / type resolution is pretty specific today and can't be easily extended |
Thanks @alamb for your feedback! For #12644 i started to prototype a new type system that DataFusion could adopt, with assumptions that all types should be first-class citizens (so a |
Let's put the exact mechanics (Tasks) aside, would be awesome to have agreement on the two things first:
|
Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)
I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal, though it has areas that could be improved (e.g. etendibel coercion, user defined types, etc) |
I think #12622 is important if we want a nice function signature and coercion rule. |
Thank you @alamb @jayzhan211 for your comments!
I do think this is the current goal.
Not fully. For example, the DF coercion rules are anywhere. The function signatures are consulted at various phases of the planning. Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that.
yes to both! plus #12644 but even if we do all of that, we still run on top of Arrow for execution (which is not a bad thing), so we need a layer at which the (simplified) arrow types is the type system (simplified to allow #12720). The physical plans could be the only such layer, but it's a missed optimizations reuse opportunity, they are too physical already. Thus we need a logical plan layer where engines can integrate. TL;DR |
I think in my mind the goal is to be an extensible query engine and an LLVM for data intensive systems. Perhaps that is subtlety different 🤔
Pedantically I would probably phrase this as "the current code doesn't allow that" -- I think that by adding a suitable API it would be straightforward to allow user defined coercion rules. There may be different opinions on if that would be an architectural change, but I would say it isn't. |
It seems to me like there is broad agreement that:
Maybe we can treat these as three different features At least one good next step is probably file a ticket describing what "user defined coercion rules" would look like |
I don't want this ticket to be about flexible (aka user defined) coercion rules. I hope there is no disagreement that more code structure, simpler code flow and well defined contracts is a good thing. |
#13028 change is a great supporting example for this initiative. |
from https://datafusion.apache.org/
ibid.
The two passages indicate dual nature of DataFusion
First, it's a complete query engine, with which users can interact e.g. using datafusion-cli (or
dfdb
proposed in #11979) or DataFusion's SQL and DataFrames frontends.Second is what @alamb often refers to as DataFusion being "LLVM for query engines", a building block for other applications, where components are re-usable.
(See also https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz, https://datafusion.apache.org/contributor-guide/architecture.html and https://docs.rs/datafusion/latest/datafusion/index.html#architecture)
A query engine may implement a dialect of SQL that is identical with DataFusion SQL, or different.
DataFusion doesn't need to know the details of the query engine being implemented (it is extensible rather than being union of all the query languages). DataFusion needs to allow expressing different query languages, providing a reliable and dialect-agnostic foundation for applications building on top of DataFusion.
A query engine and a query language have certain attributes
Internally, such a query engine transforms user queries (according to syntax, scoping, typing rules) into relational algebra operations, optimizes and executes them. Sounds simple, but in reality this is super complex and this is where DataFusion really shines.
This epic issue is a collection of tasks important for achieving this goal. Its description should be expected to be a living document.
On the high level, it aims at separation of concerns. The two roles DataFusion has -- an implementation of DataFusion SQL, a particular SQL dialect along and being reusable building block -- they should be clearly separated so that dialect-specific behavior isn't implicitly inherited by components needed to be dialect-agnostic.
Goals and Overview
As a result DataFusion should have the following layers
Frontend
DataFusion frontend includes DataFusion SQL - DataFusion's implementation of SQL. DataFusion frontend also includes the DataFrame API.
It provides the following functionality
sqlparser
crate that DataFusion SQL usesDataType
) but that should change in [Proposal] Decouple logical from physical types #11513)DataFusion main
DataFusion "main" or "core" represents a dialect- and syntax-agnostic execution query engine library, for building query engines.
It serves as an API for all query engine implementors that decide to build on top of DataFusion.
It provides the following functionality
DataType
([Proposal] Decouple logical from physical types #11513)UnwrapCastInComparison
for inputCAST(a_number AS varchar(4)) > '1234'
should check whether a_number can safely be represented in 4 characters (would the cast fail?). it clearly can't fail for eg int8 type, and clearly can fail for int32 typea = b
types ofa
andb
match exactly)IS NULL
andIS UNDEFINED
or have a Wildcard expression, since those things are handled by the frontend layer.Alias
fromExpr
#1468DataFusion execution
It provides the following functionality
ExecutionPlan
,PhysicalExpr
)DataType
([Proposal] Decouple logical from physical types #11513)DataType
directly is also an option, but would prevent runtime-adaptive execution, see Runtime-adaptive data representation #12720, so simplified types would be strongly preferred, [Proposal] Decouple logical from physical types #11513Tasks
(Tasks to be added here once they are discovered and defined.)
sqlparser
dependency from all crates exceptdatafusion-sql
anddatafusion-cli
(it is OK to use for tests)optimizer
crateSessionState
into "frontend SessionState" and "core SessionState": the layers build on each other, so every layer is concerned with runtime-relevant attributes such as RuntimeEnv, but only the frontend needs to know the function repertoire for example ([EPIC] Easier extension configuration SessionState / SessionConfig #12550 seems related)datafusion/core
into frontend part and reusable part. the public crate name isdatafusion
which is perfect from a ready to consume frontend component, so maybe we could introduce core asdatafusion-core
crateUnwrapCastInComparison
example above) -- this is clearly vague and needs further specificationThe text was updated successfully, but these errors were encountered: