-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: integrate unity catalog with datafusion #1338
Conversation
Sure, I will gladly test it. |
pub const AZURE_PUBLIC_CLOUD: &str = "https://login.microsoftonline.com"; | ||
} | ||
|
||
/// Trait for providing authrization tokens for catalog requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in authrization
rust/src/data_catalog/unity/mod.rs
Outdated
.map(|c| c.as_ref().to_string()) | ||
.unwrap_or("main".into()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to do this instead:
.map(|c| c.as_ref().to_string()) | |
.unwrap_or("main".into()), | |
.as_ref() | |
.map(|c| c.as_ref()) | |
.unwrap_or("main"), |
Success(Table), | ||
/// Error response | ||
Error(ErrorResponse), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it and unfortunately , so right now it fails to parse (the same for Error(ErrorResponse)
is not equivalent to Error { error_code: String, message: String }
Success(Table)
.
So far I haven't found any other way other than defining the structs inline, like it was before.
I thought the following might work but it didn't.
#[derive(Deserialize)]
#[serde(untagged)]
pub enum GetTableResponse {
/// Successful response
Success {
/// Table information
#[serde(flatten)]
table: Table,
},
/// Error response
Error {
/// Error information
#[serde(flatten)]
err: ErrorResponse,
},
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm .. if we have to we'll do it, but this makes the types a bit annoying to work with.
The example responses from the docs parse fine in the tests though. would it be possible to provide a response body that doe snot parse? Admittedly, the error case I just inferred from your code :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I'm even more confused. From the test, it looks like it should work. I don't know what serde is doing but it didn't parse the real response before I changed the definition back.
I'll try to write a test case to reproduce the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the error cause was different than I thought (see the comment in struct Table
). At least the definitions don't have to be inlined.
} | ||
"access_token" | "unity_access_token" | "databricks_access_token" => { | ||
Ok(UnityCatalogConfigKey::AccessToken) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should also match AZURE_TENANT_ID, AZURE_CLIENT_ID and AZURE_CLIENT_SECRET, right?
Btw, I wouldn't use the unity_
alias for Azure tenant and client because it's not Databrick/Unity specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been going back and forth on this and hope to experiment a bit here.
The main nuisance I encountered was that in scenarios where you are juggling many connections sometime we pick up things from the environment that aren't intended.
Generally speaking though yes, these are quite common and users would likely expect this to be picked up.
pub updated_by: String, | ||
|
||
/// List of schemes whose objects can be referenced without qualification. | ||
pub sql_path: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the culprit. My Unity Catalog returns response without any sql_path
.
pub sql_path: String, | |
#[serde(default)] | |
pub sql_path: String, |
I fixed it like this but looking at the documentation, it seems like most of the response fields could be optional (even though it doesn't make much sense in some cases).
"created_by": "string", | ||
"name": "table_name", | ||
"updated_by": "string", | ||
"sql_path": "string", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a test case with omitted sql_path
but who knows which other fields can be missing from the response under different circumstances...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can just make a bunch of these fields optional. the ones we actually need are quite limited and should always be available - like name etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WE now just derive default for Schema
and Table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rechecked with the latest changes and deriving Default for the struct does not change the deserializer behavior.
The #[serde(default)]
macro still needs to be used either per field or on the whole struct making all the fields optional with empty default. An alternative would be to use Option<String>
where it makes sense.
@roeap So far I only verified it works with Databricks personal access token but if I figure out which AAD roles are needed, I could check the OAuth flows (+ azure cli) as well. |
That would be awesome. Haven't worked with this yet. but from the docs i gather you don't really need specific roles, but rather "Just" need the AAD object user to be available / assigned in the unity catalog. |
This is correct. If you have an AAD bearer token with scope set to the azure databricks application you can do everything a databricks personal access token can do. The scope being -> 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d |
Box::new(ClientSecretOAuthProvider::new( | ||
client_id, | ||
client_secret, | ||
authority_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Azure service principal account works (specifying Azure tenant ID as authority_id
and account appId with password as client_id
and client_secret
respectively).
Successfully tested service principal (client ID / client secret) and Azure CLI credentials. I don't have an opportunity to try with managed identity or workload identity right now... |
Thanks for testing this @nohajc!
I guess we know now, that we are using the right scopes, the rest is just how we get the AAD token, that we have validated elsewhere, so we should be good 🤞. |
(I still need to use |
on it :) - applying some fixes and adding datafusion tests. |
Co-authored-by: nohajc <[email protected]>
@roeap Is this ready to get merged? |
Rebase failed
Rebase failed
Description
This is a follow-up from #1331, where we added support for the unity catalog and does a couple of things:
SchemaProvider
andCatalogProvider
At first glance porting the whole client implementation may seem excessive, but my hope is that with this we can eventually replace the rusoto dependency from the glue catalog, as well as implement the
LogStore
locking mechanisms suggested by @nkarpov while keeping everything SDK-free :).@nohajc - unfortunately I do not have access to a unity catalog instance right now. Would it be possible for you to validate, that the requests are still working?
Related Issue(s)
No issues