Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic FlightTableFactory with a default FlightSqlDriver #11938

Closed
wants to merge 3 commits into from

Conversation

ccciudatu
Copy link
Contributor

@ccciudatu ccciudatu commented Aug 11, 2024

Which issue does this PR close?

Closes #7731

Rationale for this change

Create external tables backed by Arrow Flight(SQL) services.

What changes are included in this PR?

  • A generic FlightTableFactory that can integrate any Arrow Flight RPC service as a TableProviderFactory, expecting a FlightDriver implementation to handle the GetFlightInfo flow.
  • Default FlightSqlDriver implementation registered as FLIGHT_SQL.

Are these changes tested?

yes

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Aug 11, 2024
@ccciudatu ccciudatu force-pushed the flight-data-source branch 10 times, most recently from cc06778 to c38db1c Compare August 12, 2024 08:54
@ccciudatu ccciudatu force-pushed the flight-data-source branch 2 times, most recently from 69d6239 to 820021b Compare August 12, 2024 20:32
@alamb alamb changed the title Generic Arrow Flight RPC data source with a Flight SQL adapter [datafusion-cli] Generic Arrow Flight RPC data source with a Flight SQL adapter Aug 12, 2024
@alamb
Copy link
Contributor

alamb commented Aug 12, 2024

Thank you @ccciudatu

I have been thinking recently that we need to figure out what to do with datafusion-cli -- it seems to currently be serving two roles:

  1. Debug tool for playing around with supported sql, making reproducers
  2. An actual CLI tool with pre-built integrations (I am thinking specifically of Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792 from @xinlifoobar here)

There is a tension for adding integrations into datafusion-cli as it makes the overall project that much more complicated

@stuartcarnie and I were thinking of writing up a proposal to make a dfdb, or dfcli binary where we could start adding all these pre-built integrations. I think that would be quite compelling

I plan to try to write this up / maybe try and make an example repo or something

@ccciudatu
Copy link
Contributor Author

ccciudatu commented Aug 13, 2024

@alamb The [datafusion-cli] commit was just a follow-up to make the new FLIGHT_SQL table "storage" type available in datafusion-cli (i.e. register the new config namespace). I'm sorry if that was misleading. The patch is supposed to add support for Arrow Flight (SQL) data sources.
However, I'm glad this triggered such an interesting high-level discussion about revisiting the cli package. :)

I will squash the whole thing in a single commit and restore the original title .
Also, I just realized that I need to add proto serialization for the new FlightExec physical plan, so I'll convert this the PR to a draft for now.

@ccciudatu ccciudatu marked this pull request as draft August 13, 2024 13:40
@github-actions github-actions bot added the proto Related to proto crate label Aug 14, 2024
@ccciudatu ccciudatu changed the title [datafusion-cli] Generic Arrow Flight RPC data source with a Flight SQL adapter Generic FlightTableFactory with a default FlightSqlDriver Aug 14, 2024
@ccciudatu ccciudatu marked this pull request as ready for review August 14, 2024 01:12
@ccciudatu
Copy link
Contributor Author

I added the proto marshalling and marked the PR "ready for review", once again.

@alamb
Copy link
Contributor

alamb commented Aug 14, 2024

@alamb The [datafusion-cli] commit was just a follow-up to make the new FLIGHT_SQL table "storage" type available in datafusion-cli (i.e. register the new config namespace). I'm sorry if that was misleading. The patch is supposed to add support for Arrow Flight (SQL) data sources.

FWI I wrote up my thoughts here; #11979

I hope to review this PR later today

One thing that might be useful to ask is "does this code need to be in the datafusion repo, or could it be put in another repo -- like for example https://github.com/datafusion-contrib/datafusion-table-providers 🤔

We could make you your own repo if you wanted

@ccciudatu
Copy link
Contributor Author

@alamb Got it, makes sense.
My initial thought was that having this in the main repo would enable a better overall integration (i.e. datafusion-cli support, transparent proto serialization, etc.) but I guess having this in a contrib project should not limit these options. I'll have a better look at the project you linked to see if that's a better fit, but I'll go with whatever you decide if you come to a conclusion.

Separately, I promise I'll stop force-pushing updates now that I know it's being reviewed. I may push some fixups though... :)
The last change was to account for server implementations that do not allow reusing the same FlightInfo object, so now GetFlightInfo is called before every query against a "flight table", rather than only once on table creation.
Also, I found a weird/buggy behaviour related to some schema mismatch, i.e. columns seem to get reordered in certain scenarios. I will try to reproduce that in a test and follow up with the fix or at least the failing test to reproduce it.

@ccciudatu
Copy link
Contributor Author

I found my embarrassing mistake and added a fixup commit.

// The bearer token has to be passed to the clients that perform
// the DoGet operation, since Dremio, Ballista and possibly others
// expect the bearer token they produce with the handshake response
// to be set on all subsequent requests, including DoGet.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Remove this "stolen" client implementation and submit a PR to arrow-rs to make the token visible in the official client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb on average, my PRs are of reasonable size. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate it @ccciudatu -- I am sorry I haven't had a chance to comment more thoroughly on this PR yet...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, @alamb! The fact that you are even considering reviewing such a huge patch makes you one of the nicest people I know. 🙃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is nice of you to say -- the whole point of doing this in open source is to get other people to contribute things and a key way to do that is to actually look at their proposals 😆

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you @ccciudatu for this contribution. It si

TLDR is I recommend we put this code into a separate repo and indeed have made one here: https://github.com/datafusion-contrib/datafusion-table-provider-flight

The rationale is that since this code uses publically available APIs so there is no technical reason it must be in the DataFusion core crate that I can see

You raise an excellent point about a pre-integrated binary being easier to use with datafusion-cli, and I agree such a pre-integration would be useful to users

However, given how much is already in the DataFusion core crates I think adding more features at this time is likely not a great idea.

Instead, as I wrote down, I think it would be ideal if this provider (along with others) could be pre-integrated into a single binary via something like #11979

I left some suggestions along the way in case it is helpful to you

/// Collects and reports back config entries. Only used to persuade `datafusion-cli`
/// to accept the `flight.` prefix for `CREATE EXTERNAL TABLE` options.
#[derive(Default, Debug, Clone)]
pub struct FlightOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration system is another thing I think that a focused standalone binary would be able to do better than DataFusion's key=value system

/// .insert("CUSTOM_FLIGHT".into(), Arc::new(FlightTableFactory::new(
/// Arc::new(CustomFlightDriver::default())
/// )));
/// let _ = ctx.sql(r#"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really cool

impl FlightMetadata {
/// Provide custom [PlanProperties] to account for service specifics,
/// such as known partitioning scheme, unbounded execution mode etc.
pub fn new(info: FlightInfo, props: PlanProperties, grpc: MetadataMap) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really cool -- we actually implemented some version of this custom for InfluxDB 3.0 for services that exchange data between themselves


use crate::datasource::flight::{FlightDriver, FlightMetadata};

/// Default Flight SQL driver. Requires a `flight.sql.query` to be passed as a table option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the configuration story for coinfiguring datasources like this is sorely lacking in datafusion-cli

It would be kind of cool to support something like

create crednetials for flightsql as ....

select * from 'flightsql://myhost/SELECT%20*%20FROM%20foo'

@ccciudatu
Copy link
Contributor Author

Thank you, @alamb . I will move this to the new repo and close it from here.
It can't really be moved as such, as the proto stuff is cumbersome to replicate outside of this repo, so I think I'll switch to a serde-based codec for the new physical plan node.
The datafusion-cli integration was never a priority for me (I just thought it was the civilised way to add new table factories), so I have no problem living without it for now.

@alamb
Copy link
Contributor

alamb commented Aug 19, 2024

It can't really be moved as such, as the proto stuff is cumbersome to replicate outside of this repo, so I think I'll switch to a serde-based codec for the new physical plan node.

Makes sense -- thank you for your understanding

@ccciudatu
Copy link
Contributor Author

Moved to its own repo: datafusion-contrib/datafusion-table-provider-flight#1
Thanks, @alamb !

@ccciudatu ccciudatu closed this Aug 19, 2024
@phillipleblanc
Copy link
Contributor

@ccciudatu Just seeing this now - I'm one of the maintainers of https://github.com/datafusion-contrib/datafusion-table-providers. I think it would be useful to have a single crate of community maintained TableProviders and I think this would fit nicely in there, and I could make you a maintainer on that repo as well.

We've also done work to make the table providers in there play nicely with https://github.com/datafusion-contrib/datafusion-federation, which could also be good for this implementation. Also happy if you want to keep it in a separate crate/repo, we can link to it from README.md.

@ccciudatu
Copy link
Contributor Author

You are right, @phillipleblanc, I enjoyed having my own repo for a while, but I soon started to feel lonely. :)
I'm glad you brought this up! I'll try to follow up with a PR to make this part of datafusion-table-providers, where it most likely belongs.
As a side note, I only published it here initially because I noticed that #7731 was open and acknowledged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow Flight SQL / ADBC datasource
3 participants