Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: integrate unity catalog with datafusion #1338

Merged
merged 28 commits into from
Sep 20, 2023
Merged

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented May 6, 2023

Description

This is a follow-up from #1331, where we added support for the unity catalog and does a couple of things:

  • Adds a port of the client implementation from object_store to communicate with unity catalog.
  • refactors and extends the current client to accept configuration much in the way we do for the storage backends
  • adds additional (so far only Azure) credentials to authorize requests
  • adds an integration with datafusion's SchemaProvider and CatalogProvider
  • add unit tests for various components.

At first glance porting the whole client implementation may seem excessive, but my hope is that with this we can eventually replace the rusoto dependency from the glue catalog, as well as implement the LogStore locking mechanisms suggested by @nkarpov while keeping everything SDK-free :).

@nohajc - unfortunately I do not have access to a unity catalog instance right now. Would it be possible for you to validate, that the requests are still working?

Related Issue(s)

No issues

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate rust labels May 6, 2023
@nohajc
Copy link
Contributor

nohajc commented May 6, 2023

@nohajc - unfortunately I do not have access to a unity catalog instance right now. Would it be possible for you to validate, that the requests are still working?

Sure, I will gladly test it.

@roeap roeap force-pushed the unity-catalog branch from 29f2b9d to 086d9e9 Compare May 6, 2023 18:08
pub const AZURE_PUBLIC_CLOUD: &str = "https://login.microsoftonline.com";
}

/// Trait for providing authrization tokens for catalog requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in authrization

Comment on lines 439 to 579
.map(|c| c.as_ref().to_string())
.unwrap_or("main".into()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to do this instead:

Suggested change
.map(|c| c.as_ref().to_string())
.unwrap_or("main".into()),
.as_ref()
.map(|c| c.as_ref())
.unwrap_or("main"),

Success(Table),
/// Error response
Error(ErrorResponse),
}
Copy link
Contributor

@nohajc nohajc May 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it and unfortunately Error(ErrorResponse) is not equivalent to Error { error_code: String, message: String }, so right now it fails to parse (the same for Success(Table).

So far I haven't found any other way other than defining the structs inline, like it was before.

I thought the following might work but it didn't.

#[derive(Deserialize)]
#[serde(untagged)]
pub enum GetTableResponse {
    /// Successful response
    Success {
        /// Table information
        #[serde(flatten)]
        table: Table,
    },
    /// Error response
    Error {
        /// Error information
        #[serde(flatten)]
        err: ErrorResponse,
    },
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm .. if we have to we'll do it, but this makes the types a bit annoying to work with.

The example responses from the docs parse fine in the tests though. would it be possible to provide a response body that doe snot parse? Admittedly, the error case I just inferred from your code :).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I'm even more confused. From the test, it looks like it should work. I don't know what serde is doing but it didn't parse the real response before I changed the definition back.

I'll try to write a test case to reproduce the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the error cause was different than I thought (see the comment in struct Table). At least the definitions don't have to be inlined.

}
"access_token" | "unity_access_token" | "databricks_access_token" => {
Ok(UnityCatalogConfigKey::AccessToken)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should also match AZURE_TENANT_ID, AZURE_CLIENT_ID and AZURE_CLIENT_SECRET, right?
Btw, I wouldn't use the unity_ alias for Azure tenant and client because it's not Databrick/Unity specific.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been going back and forth on this and hope to experiment a bit here.

The main nuisance I encountered was that in scenarios where you are juggling many connections sometime we pick up things from the environment that aren't intended.

Generally speaking though yes, these are quite common and users would likely expect this to be picked up.

pub updated_by: String,

/// List of schemes whose objects can be referenced without qualification.
pub sql_path: String,
Copy link
Contributor

@nohajc nohajc May 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the culprit. My Unity Catalog returns response without any sql_path.

Suggested change
pub sql_path: String,
#[serde(default)]
pub sql_path: String,

I fixed it like this but looking at the documentation, it seems like most of the response fields could be optional (even though it doesn't make much sense in some cases).

"created_by": "string",
"name": "table_name",
"updated_by": "string",
"sql_path": "string",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a test case with omitted sql_path but who knows which other fields can be missing from the response under different circumstances...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can just make a bunch of these fields optional. the ones we actually need are quite limited and should always be available - like name etc..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WE now just derive default for Schema and Table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rechecked with the latest changes and deriving Default for the struct does not change the deserializer behavior.
The #[serde(default)] macro still needs to be used either per field or on the whole struct making all the fields optional with empty default. An alternative would be to use Option<String> where it makes sense.

@nohajc
Copy link
Contributor

nohajc commented May 7, 2023

@roeap So far I only verified it works with Databricks personal access token but if I figure out which AAD roles are needed, I could check the OAuth flows (+ azure cli) as well.

@roeap
Copy link
Collaborator Author

roeap commented May 7, 2023

So far I only verified it works with Databricks personal access token but if I figure out which AAD roles are needed, I could check the OAuth flows (+ azure cli) as well.

That would be awesome. Haven't worked with this yet. but from the docs i gather you don't really need specific roles, but rather "Just" need the AAD object user to be available / assigned in the unity catalog.

@jordandakota
Copy link

That would be awesome. Haven't worked with this yet. but from the docs i gather you don't really need specific roles, but rather "Just" need the AAD object user to be available / assigned in the unity catalog.

This is correct. If you have an AAD bearer token with scope set to the azure databricks application you can do everything a databricks personal access token can do. The scope being -> 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d

Box::new(ClientSecretOAuthProvider::new(
client_id,
client_secret,
authority_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Azure service principal account works (specifying Azure tenant ID as authority_id and account appId with password as client_id and client_secret respectively).

@nohajc
Copy link
Contributor

nohajc commented May 13, 2023

That would be awesome. Haven't worked with this yet. but from the docs i gather you don't really need specific roles, but rather "Just" need the AAD object user to be available / assigned in the unity catalog.

Successfully tested service principal (client ID / client secret) and Azure CLI credentials. I don't have an opportunity to try with managed identity or workload identity right now...

@roeap
Copy link
Collaborator Author

roeap commented May 13, 2023

Thanks for testing this @nohajc!

I don't have an opportunity to try with managed identity or workload identity right now...

I guess we know now, that we are using the right scopes, the rest is just how we get the AAD token, that we have validated elsewhere, so we should be good 🤞.

@nohajc
Copy link
Contributor

nohajc commented May 13, 2023

(I still need to use #[serde(default)] on the sql_path response field.)

@roeap
Copy link
Collaborator Author

roeap commented May 13, 2023

(I still need to use #[serde(default)] on the sql_path response field.)

on it :) - applying some fixes and adding datafusion tests.

@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jun 2, 2023
@ion-elgreco
Copy link
Collaborator

@roeap Is this ready to get merged?

@rtyler rtyler marked this pull request as ready for review September 19, 2023 06:46
@rtyler rtyler added this to the Rust v0.16 milestone Sep 19, 2023
@rtyler rtyler enabled auto-merge (rebase) September 19, 2023 15:19
auto-merge was automatically disabled September 19, 2023 15:34

Rebase failed

@rtyler rtyler enabled auto-merge (rebase) September 19, 2023 20:33
auto-merge was automatically disabled September 19, 2023 20:51

Rebase failed

@rtyler rtyler enabled auto-merge (rebase) September 19, 2023 23:33
auto-merge was automatically disabled September 19, 2023 23:48

Rebase failed

@rtyler rtyler merged commit 6746dd4 into delta-io:main Sep 20, 2023
@roeap roeap deleted the unity-catalog branch May 26, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants