Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Create dfdb, a new CLI different than datafusion-cli with pre-built integrations #11979

Closed
alamb opened this issue Aug 14, 2024 · 30 comments

Comments

@alamb
Copy link
Contributor

alamb commented Aug 14, 2024

TLDR

  • Keep datafusion-cli in the apache/datafusion repo
  • Make a new repo with a new CLI called dfdb (or datafusion-cli++or dfcli) which is purposely designed for running queries against a wide variety of pre-integrated sources

Problem Statement

As of today, datafusion-cli (docs) serves two roles:

  1. A debugging / testing tool for the DataFusion query engine developers
  2. A CLI tool for actually doing useful processing if files (locally and remotely using object store), similar to the duckdb CLI tool

It is really sweet to have a CLI that lets you query a directory of parquet files

DataFusion CLI v41.0.0
>
> select "WatchID", "EventDate", "URL" from hits_partitioned limit 10;
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| WatchID             | EventDate | URL                                                                                                  |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| 6904841588848398438 | 15895     | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437                             |
...
| 7551542980199423249 | 15895     | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437                             |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
10 row(s) fetched.
Elapsed 0.059 seconds.

However, similarly to the discussion with have had with datafusion-pytyhon this dual role leads to a tension between keeping the core lean and easier to embed (e.g. fewer dependencies) and making a better CLI experience

Examples of Friction

I have recently seen some PRs that are basically integrations that would make datafusion-cli a better end user tool, but bring more dependencies and complexity to DataFusion. For example

  1. Hugging face Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792 from @xinlifoobar
  2. FlightSQL: Generic FlightTableFactory with a default FlightSqlDriver #11938 from @ccciudatu

I realize I have been partly responsible for this mess and for that I apologize.

Proposal

I propose resolving this conflict by creating a new repository for the "CLI tool people actually use"

We would keep datafusion-cli as it is, a relatively small and a thin wrapper around the core engine. I don't think we should remove features but we also wouldn't add them (other than what was added to the engine by default)

We would add many new features / capabilitues to this dfdb tool

Examples of new features

There are several obvious examples of integrations that would be super useful for users of a CLI tool but not appropriate for the datafusion repo (due to circular dependencies, for example):

@philippemnoel actually referrs to the lack of built in Apache Iceberg support in his blog about switching to using duckdb. This is sad given all the code to use datafuson and delta exists, there just isn't a pre-integrated binary that shows how to hook it up and it easy to get up and use

Other cool features

There are many other cool features I have dreamed about adding to a CLI that might be more appropriate in a separate repo. Some ideas to inspire:

  1. Local catalog support (imagine if you could store your CREATE EXTERNAL TABLE definitions in a file someere (.open <filename> style)
  2. Local parquet metadata cache (imagine being able to cache the parquet metadata for 100s of files in object store in some sort of persistence format so future queries were fast)
  3. SQL auto completion,
  4. etc.

Open questions

Should the new tool be in the datafusion-contrib organization or the apache organization?

The tradeoffs are that datafusion-contrib could move faster / has less governance overhead, but would also lose the apache community

I personally suggest we start with this tool in the datafusion-contrib organization and if there is interest we can discuss bringing it back to the apache organization.

@edmondop
Copy link
Contributor

I had a related problems a couple of weeks ago, I wanted to distribute a CLI that had some additional udfs bundled with it.

I wonder if we need to set up a plugin system for the CLI. I am happy to make a design proposal about it @alamb

@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2024

I wonder if we need to set up a plugin system for the CLI. I am happy to make a design proposal about it @alamb

That would be awesome

Another thing that comes to mind might be for you to fork this datafusion-cli code and bundle your UDFs with it.

Or we could publish most of the code for this CLI to crates.io with the idea that people could make their own CLI mashups with a little bit of configuration setup. The ideas are endless!

@matthewmturner
Copy link
Contributor

matthewmturner commented Aug 14, 2024

Love this idea. I had been working on something similar in the past but unfortunately life got in the way so wasnt able to push it as far as I wanted.

I am currently building a new terminal app with DataFusion at the core, but it is more domain specific than general purpose like I think the new proposed tool is expected to be (and if i wasnt working on that, I would probably volunteer to push this forward). That being said, and depending on the direction of this new tool, there could be some potential overlap / common functionality with what im working on. If thats the case then I would be very happy to collaborate on those pieces. I hope to open source what I'm working on in the fall around the time of v0.1 release. I'll keep my eye on this as it progresses though.

@milenkovicm
Copy link
Contributor

Making generic plugin system would be great, but may be a bit of a challenge with lack of stable rust ABI.

Making some kind of CliContextBuilder where all udfs, table factories ... could be registered may be easier to do.
Maybe existing CliSessionContext could be extended adding one liner methods for most common extensions

@andygrove
Copy link
Member

I think this is a great idea.

@westonpace
Copy link
Member

I don't know if others are interested / if this matches the need but it would be cool to make a "CLI-driven query engine frontend" that isn't coupled to any particular query engine (e.g. it could send queries as either SQL or substrait, potentially via something like adbc).

In other words, it could be something like squirrel / dbeaver for analytics.

@phillipleblanc
Copy link
Contributor

I think this is a great idea. Happy to do the work to integrate https://github.com/datafusion-contrib/datafusion-table-providers into it, I think it would be a cool use-case to have.

@edmondop
Copy link
Contributor

edmondop commented Aug 15, 2024

Will do some homework. Maybe we only need a datafusion-cli repo that one can fork stand-alone to create a proprietary CLI distribution and the plugin system is effectively an overkill

How should we go about dfdb? Do we think we should have "flavors", maybe like the following?

  • dfdb-minimal (maybe datafusion-cli ?)
  • dfdb-standard (whatever that means?)
  • dfdb-full (whatever that means?)

Or should we aim at having a single "flavor" only?

@jayzhan211
Copy link
Contributor

I don't know if others are interested / if this matches the need but it would be cool to make a "CLI-driven query engine frontend" that isn't coupled to any particular query engine (e.g. it could send queries as either SQL or substrait, potentially via something like adbc).

In other words, it could be something like squirrel / dbeaver for analytics.

I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.

This sounds like "Ibis" for CLIs and I think is a neat idea

However, I am personally not likely to work on such a thing -- I am far more interested in showing off / using the DataFusion execution engine more broadly, in part as I think it will drive more use and thus more contributions back to DataFusion

So therefore I am not opposed to creating some sort of pluggable backend query engine system, but I think I would like to focus on the DataFusion one.

I am thinking what I will try to do if no one beats me to it is to sketch out what a dfdb repo might look like

  1. basically copy the datafusion-cli's code / docs into its own repo
  2. polish up the code / APIs perhaps to make it clearer how to register custom functions, table providers, etc

@samuelcolvin
Copy link
Contributor

samuelcolvin commented Aug 15, 2024

I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.

Surely this should just be a CLI that speaks arrow flight SQL? I've tried to persuade ClickHouse to adopt arrow flight SQL to get around their woeful Python client.

That sounds interesting, but I agree with @alamb that that's a different question.


I like the broad idea here, while I like datafusion-cli, it definitely feels incomplete.

Where I disagree (I think) with @alamb is around what it lacks — my main problem with datafusion-cli is the "UI" — all the small parts of the best CLIs which make the difference:

  • syntax highlighting (like duckdb)
  • auto-complete - aka a language server for datafusion (like pgcli)
  • good multiline vs single line support (like pgcli)
  • good error summaries e.g. point are the part of the statement that's wrong (like ipython)
  • pretty tables (like duckdb or rich, not "ugly" ascii dashes), while we at it also support for more display formats like markdown
  • convienient CSV and JSON export (like duckdb) — maybe datafusion-cli already does this, I haven't tried
  • convient way to display rows as "details" when they're too long to show as tables (like clickhouse-cli)
  • proper terminal interface for horizontal scrolling (I think rich/textual has good support for this)
  • good introspection shortchuts (like pgcli) - e.g. \d or \dt
  • support for running multiple queries from a file (like psql I think)
  • (minor) good support for timing queries, on by default
  • JSON query support, I guess since Rust has a poor story when it comes to dynamic linking, just bundling datafusion-functions-json would be good enough

There's one more thing that I've wanted from a database CLI many times and nothing (AFAIK) supports:

  • Plotting right from the CLI - I guess the obvious solution is terminal ploting like this (more here), you could could also imagine a plot command opening a plot in a webbrowser for easier interaction

Anyway, that's my wish list, as you might guess I've thought about this a fair bit!


One last thing I'll say — virtually all of the above features would be independent of the query engine being used, so there is a good argument to build one great CLI user experience (or let someone else build it) and make it pluggable into any query engine that speaks Arrow Flight SQL, whether that be over a network, or just as an API within a single binary or between dynamically linked libaries.

@findepi
Copy link
Member

findepi commented Aug 15, 2024

One last thing I'll say — virtually all of the above features would be independent of the query engine being used, so there is a good argument to build one great CLI user experience (or let someone else build it)

this seems to be the goal of the https://github.com/xo/usql project.
I am not saying DF CLI shouldn't try to address those goals, just acknowledging it's both ambitions and potentially not central to the project (DF considers itself a query engine).

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

Where I disagree (I think) with @alamb is around what it lacks — my main problem with datafusion-cli is the "UI" — all the small parts of the best CLIs which make the difference:

indeed, if the goal is to make a duckdb replacement this would be a pretty great list for someone to work on

However, as my personal goal is not (at least not yet 😆 ) to make a duckdb replacement, so I am not likely going to able to focus on this list. I will definitely steal it as "wishlist for dfdb" however 🚀

@zeroshade
Copy link
Member

Surely this should just be a CLI that speaks arrow flight SQL? I've tried to persuade ClickHouse to adopt arrow flight SQL to get around their woeful Python client.

Another possibility might be to leverage ADBC (which has a flight SQL driver). DuckDB implements the ADBC interface, so between that and the other available drivers for ADBC, it might be more beneficial than just using flight SQL?

@korowa
Copy link
Contributor

korowa commented Aug 21, 2024

Love this idea. I had been working on something similar in the past but unfortunately life got in the way so wasnt able to push it as far as I wanted.

Just want to highlight the existing tool authored by @matthewmturner one more time -- perhaps there is no need for another terminal app as the existing one (datafusion-tui) seems to fit the goal (independent contrib app with the focus on the UI / additional plugins), and it just needs more attention/efforts.

@alamb
Copy link
Contributor Author

alamb commented Aug 21, 2024

Just want to highlight the existing tool authored by @matthewmturner one more time -- perhaps there is no need for another terminal app as the existing one (datafusion-tui) seems to fit the goal (independent contrib app with the focus on the UI / additional plugins), and it just needs more attention/efforts.

I agree -- @matthewmturner what do you think about rekindling the 🔥 around dft (the name is sufficiently short that I like it!)? https://github.com/datafusion-contrib/datafusion-tui

I am going to be on vacation next week so may have some time to play around with this idea -- perhaps I can make some PRs to that repo?

@matthewmturner
Copy link
Contributor

I would definitely be happy to have work pick back up there. I do believe that it's original goal is generally aligned with the above comments / wish list.

Based on my current work on the other TUI I am working on, which has similar functionality, I think theres some cleanup that will need to be done (updating dependencies and perhaps refactoring the event loop / handler ). I would be happy to drive the efforts on that to get everything back up to date (as well as cleanup the issue backlog).

Separately, I have some code in my current project that helps converting record batches to a Table format that ratatui understands and I think it would also be helpful here. I had plans to eventually turn this into a sub crate but it probably makes more sense in datafusion-contrib - perhaps in a new repo like datafusion-tui-tools so that other projects which want to build TUIs with datafusion can benefit (i guess right now thats just my separate project and dft but maybe this will inspire / enable more domain specific ones which would be cool).

One other note, I had initially made dft compatible with both datafusion and ballista - I think we'll need to decide whether thats still needed. perhaps having a nice client for ballista could help drive more interest in it? @andygrove maybe you have thoughts.

@alamb it would certainly be good to get your view on the current state of the project / any PRs / new issues. An approach that might be practical in the short run is if I focus on getting the core app structure / dependencies updated and to the extent you or anyone else has time if we could get datafusion / arrow / object store updated (its still on the old datafusion-contrib object store).

I'm currently on vacation but would be able to start on this next week if this is ultimately the direction that we decide to go.

@alamb
Copy link
Contributor Author

alamb commented Aug 22, 2024

@alamb it would certainly be good to get your view on the current state of the project / any PRs / new issues. An approach that might be practical in the short run is if I focus on getting the core app structure / dependencies updated and to the extent you or anyone else has time if we could get datafusion / arrow / object store updated (its still on the old datafusion-contrib object store).

Sounds good - I'll try and update the dependencies as a way to get familiar with the project. This is going to be great

I spent some time poking around with it and it has some quite cool UI -- I think what I am most passionate about is setting up the integrations with iceberg-rs, s3, delta-rs, etc. I am not very good at UIs / interactions

As long as dft can be used to script things (as in run a file of sql commands) I think it will be good for me.

@matthewmturner
Copy link
Contributor

As long as dft can be used to script things (as in run a file of sql commands) I think it will be good for me.

Yes, it can already do that.

I am definitely also passionate about the integrations point (i had some ideas for that in mind from the beginning such as delta, flightsql, and excel.

@backkem
Copy link
Contributor

backkem commented Aug 23, 2024

I think this is a great idea. Happy to do the work to integrate https://github.com/datafusion-contrib/datafusion-table-providers into it, I think it would be a cool use-case to have.

I'd like to highlight the idea of having table providers in datafusion-table-providers and integrating them in datafusion-tui from there. This way the providers are also readily available for those embedding DataFusion as a library, which aligns closely with its intended use.

@alamb
Copy link
Contributor Author

alamb commented Aug 23, 2024

I'd like to highlight the idea of having table providers in datafusion-table-providers and integrating them in datafusion-tui from there. This way the providers are also readily available for those embedding DataFusion as a library, which aligns closely with its intended use.

I agree this sounds like the ideal setup to me -- having them in a separate repo would help keep the boundaries clear as well

@matthewmturner
Copy link
Contributor

@alamb When did you plan on starting to work on this? On my flight home i managed to get a good chunk through a clean rewrite leveraging the setup from my other app which is a much better setup for moving forward. I still have some more work to do but maybe by end of weekend it would be done.

I also may try to pause my other work for a few weeks to focus on dft to see if it can get in a decent shape for the upcoming NYC meetup.

@matthewmturner
Copy link
Contributor

This is the branch with the rewrite.

@alamb
Copy link
Contributor Author

alamb commented Aug 25, 2024

@alamb When did you plan on starting to work on this?

I keep telling myself "tomorrow" but then I end up getting carried away reviewing all the other good stuff going on (eg.. #12044 #12095 etc 🤣 )

THis week I am on vacation, so I have a bit more time for some fun projects (at least that is what I am telling myself)

On my flight home i managed to get a good chunk through a clean rewrite leveraging the setup from my other app which is a much better setup for moving forward. I still have some more work to do but maybe by end of weekend it would be done.

Nice! I'll go check it out now

@matthewmturner
Copy link
Contributor

@alamb i still need to add back the query history / context info and i am improving the ergonomics of navigating query results now.

@matthewmturner
Copy link
Contributor

For anyone interested, i have been on a bit of a fast and furious dev sprint making updates to datafusion-tui. ive basically done a clean rewrite to modernize it and ive gotten the following features to work

  • Local Datafusion context querying
  • FlightSQL querying (just to localhost:50051 but it will be configurable soon)
  • Register multiple custom s3 object stores
  • Register deltalake tables with CREATE EXTERNAL TABLE table STORED AS DELTATABLE
  • View logs
  • See session context information

I plan on adding one more feature (new tab for storing query execution results / stats) and then will pause feature development to focus on cleanup / usability improvements / testing / docs / etc.

If anyone wants to help test it would be appreciated else i just keep plugging away :)

if you want all those features you can run cargo r --features=flightsql,s3,deltalake

@samuelcolvin
Copy link
Contributor

Somewhat related to a great CLI experience, "leading FROM" duckdb syntax would make any autocomplete much more useful — apache/datafusion-sqlparser-rs#1400.

@matthewmturner
Copy link
Contributor

@samuelcolvin autocomplete and syntax highlighting are features i would like to support - but i dont expect to get around to them in the short term

@alamb
Copy link
Contributor Author

alamb commented Sep 20, 2024

I think dft https://github.com/datafusion-contrib/datafusion-dft seems to be serving this purpose well

@alamb
Copy link
Contributor Author

alamb commented Nov 5, 2024

I think dft is pretty close, so I am claiming this is done

@alamb alamb closed this as completed Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests