Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Azure Data Lake Storage Gen2 (aka: ADLS Gen2) in Object Store library #3283

Closed
andrei-ionescu opened this issue Dec 6, 2022 · 15 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@andrei-ionescu
Copy link

andrei-ionescu commented Dec 6, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Azure Azure Data Lake Storage Gen2 is available since a good while. Here is Microsoft Azure Data Lake Storage Gen2 REST APIs documentation.

I'm trying to access Azure ADLS Gen2 using object_store to read files. Then, as a next step that I want to try is to use DataFusion to read data from ADLS Gen2, process that data, and then write it back to the ADLS Gen2.

Looking at these lines of code — object_store/src/azure/mod.rs#L549-L574 — we can see that the ADLS Gen2 format bellow is not supported:

abfss://my_file_system@az_account_name.dfs.core.windows.net/my/path/id1/id2/

Describe the solution you'd like

Have support for Azure Data Lake Storage Gen2 in object_store library.

Describe alternatives you've considered

I did try to look into the Azure SDK for Data Lake but it does NOT seem to fit well with the object_store library that is used in DataFusion.

My end goal is to read data parquet data from ADLS Gen2 with DataFusion.

Additional context

Here is an error that looks like it doesn't find the file and this is because it looks into the wrong place. Instead of looking to .dfs.core.windows.net domain it looks to .blob.core.windows.net.

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Generic { store: "MicrosoftAzure", 
source: ListRequest { source: Error { retries: 0, message: "\u{feff}<?xml version=\"1.0\" encoding=\"utf-8\"?>
<Error><Code>AuthorizationPermissionMismatch</Code><Message>This request is not authorized to perform this
operation using this permission.\nRequestId:013baf5b-501e-003a-3a90-095d29000000\n
Time:2022-12-06T16:37:11.0059157Z</Message></Error>", source: Some(reqwest::Error { kind: Status(403), 
url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, 
host: Some(Domain("az_account_name.blob.core.windows.net")), port: None, 
path: "/my_file_system", query: Some("restype=container&comp=list&prefix=my%2Fpath%2Fid1%2Fid2%2F"), 
fragment: None } }) } } }'
@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2022

The ObjectStore library does not currently contain logic for parsing URLs, this is tracked by #2304.

However, querying ADLS buckets by path should work correctly. e.g. something like

let store = MicrosoftAzureBuilder::new()
    .with_account("az_account_name")
    .with_container_name("my_file_system")
    .with_<CREDENTIALS>(...)
    .build()?;

let files: Vec<_> = store.list(Some(&Path::from("my/path/id1/id2/"))).await?.try_collect().await?;

However, the crate can and is being used with ADLS Gen2, there is in fact code such as here specifically to handle the fact Gen2 buckets have directories.

Perhaps @roeap might be able to weigh in here, as he added the initial support for ADLS Gen2 buckets back when this repo lived elsewhere - influxdata/object_store_rs#25

@roeap
Copy link
Contributor

roeap commented Dec 7, 2022

The object_store crate should support Gen2 accounts as is. AFAIK, gen2 accounts always also expose the blob APIs, which are "almost" the same, mostly the behavior @tustvold pointed to needs to be handled to be consistent with other stores. To get it to work, just pass in a "blob style" url pointing at the same account / container as the gen2 url.

W.r.t the issue linked above, we have an implementation of a more generic url parser, mainly intented to resolve most well known url formats for storage locations.

My hope was to move this upstream behind the object store factory API currently part of datafusion. @tustvold, do you think the logic in the link above is "close enough" that we could start iterating in a PR here? If so, I'm happy to propse one.

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2022

do you think the logic in the link above is "close enough" that we could start iterating in a PR here

That would be awesome, shall I assign #2304 to you then? Perhaps something like?

AmazonS3Builder::from_env().with_url("s3://foo/").build().unwrap();

MicrosoftAzureBuilder::from_env().with_url("abfss://my_file_system@az_account_name.dfs.core.windows.net/my/path/id1/id2/").build().unwrap()

@roeap
Copy link
Contributor

roeap commented Dec 7, 2022

Sure, go ahead and assign :).

The api above looks great, maybe a first step could be to implement that on all the builders. An optional second step could be to move the datafusion object store factory trait into object store as well and introduce a builder trait with the two methods from above. Then a factory that maps the url to a specific builder form a map of schemes to implementations should be straight forward to implement?

@andrei-ionescu
Copy link
Author

andrei-ionescu commented Dec 7, 2022

@roeap I tried using the code in Delta Lake for accessing Azure Data Lake (the one that you have given above). It all ends up trying to access data in ADLS Gen2 with the wrong domain — https://{}.blob.core.windows.net — and wrong API. For ADLS the domain should be dfs.core.windows.net.

In the current implementation, an ADLS Gen2 URL:

abfss://my_file_system@az_account_name.dfs.core.windows.net/my/path/id1/id2/

Is transformed into:

https://az_account_name.blob.core.windows.net/my_file_system/my/path/id1/id2/

Which I don't think is correct for the ADLS Gen2 case. It should be:

https://az_account_name.dfs.core.windoww.net/my_file_system?directory=my/path/id1/id2/&resource=filesystem

as defined here: https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/list

@tustvold I also tried listing the files with the method you provided. The same result. I goes to the wrong domain.

After looking into the object_store Azure implementation I think it will require a new AzureClient that uses the ADLS Gen2 APIs described by Microsoft here: https://learn.microsoft.com/en-us/rest/api/storageservices/data-lake-storage-gen2

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2022

trying to access data in ADLS Gen2 with the wrong domain — https://{}.blob.core.windows.net — and wrong API
I goes to the wrong domain

My knowledge of Azure is fairly limited, but I had thought that ADLS Gen2 buckets still exposed the traditional blob APIs, and the new APIs were strictly opt-in. I think that is what this is saying, but I find the Azure documentation to be pretty hard to follow.

FWIW object_store does not currently expose APIs that would benefit from any of the additional functionality added in ADLS Gen2, it does not support directories as an intentional design feature, is there a particular reason you wish to use the ADLS Gen2 API instead of the normal blob APIs?

@andrei-ionescu
Copy link
Author

andrei-ionescu commented Dec 7, 2022

@tustvold: HDFS with ADLS Gen2 API is different than HDFS + Azure Blob. Here is list files from ADLS Gen2 with HDFS: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-hdfs-data-lake-storage#get-a-list-of-files-or-directories (see the path used in there).

I'm not interested in additional functionalities that ADLS Gen2 offers.

What I want is to be able to use the object_store to access data from ADLS Gen2 with Arrow DataFusion. I know that DataFusion uses Object Store to read from S3 and Azure Blob store but my data is in ADLS Gen2.

I'm currently able to use abfss://my_file_system@az_account_name.dfs.core.windows.net/my/path/id1/id2/ paths with Apache Spark to read and write, but I wold like to have the same functionality in DataFusion.

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2022

The link above would suggest that you can address data in ADSL Gen2 using the backwards compatible blob storage APIs which this crate supports? What am I missing here?

In particular

Blob APIs and Data Lake Storage Gen2 APIs can operate on the same data in storage accounts that have a hierarchical namespace.

Existing tools and applications that use the Blob API gain these benefits automatically. Developers won't have to modify them.

Is it possible Spark makes a distinction because this wasn't always the case perhaps?

Edit: looking back to the error message you are receiving AuthorizationPermissionMismatch perhaps the credentials you are using lack the requisite permissions to access blob storage?

@roeap
Copy link
Contributor

roeap commented Dec 7, 2022

@andrei-ionescu, I think @tustvold is correct, in that gen2 accounts will always also expose the blob apis under the blob.... In fact most gen2 api clients (i know in case of python and java), internally make use of these apis for some basic operations. Converting the url is thus deliberate - object_store looks at a url more in terms of where the data is stored, but makes no promises on the specific apis used. Since we have to be consistent between different backend implementations, we cannot easily leverage gen2 specific features - so we opted for a single client to support both services.

The error message we see suggests that whatever credential is in use likely does not possess list permission on the account or path. these have to be granted in addition to regular read permissions. Recently I stumbled across the same issue :)

@andrei-ionescu
Copy link
Author

@roeap, I'm not sure that what you're saying is correct in all cases. Even Microsoft has built separate SDKs for blob and ADLS Gen2. See here: https://github.com/Azure/azure-sdk-for-rust/tree/main/sdk

If you stumbled across my issue please let me know how you did fix it. What work around did you use?

Please provide an example on top of ADLS Gen2. Listing a folder's content with current implementation of object_store would suffice.

In the mean time I'll try add an Azure Data Lake Storage Gen2 implementation into object_store.

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2022

In the mean time I'll try add an Azure Data Lake Storage Gen2 implementation into object_store.

Let me double check that ADLS Gen2 still works correctly, I have tested it in the past and it worked, as has @roeap. Assuming it does I would rather not add an additional implementation that just changes to use equivalent but differently named APIs. That's just code duplication for no tangible benefit

Even Microsoft has built separate SDKs for blob and ADLS Gen2

Because there are features of blob storage not supported by ADLS Gen2 and vice versa, however, nothing that object_store uses or will likely ever use is not part of that set - see here

@roeap
Copy link
Contributor

roeap commented Dec 7, 2022

If you stumbled across my issue please let me know how you did fix it. What work around did you use?

for me it was just about permissions. i had assigned read all rights to a service principal, but needed to assign list permissions separately. the trace above suggests its list that is failing. are you using the same credential that was used to mount the storage with spark?

The rust sdk is actually the reason i am so sure about this. a while ago I made some significant contributions to the azure datalake crate and since we wanted to avoid pulling in blob as a dependency, we deliberately decided diverging for the way things are done in other language sdks.

That being said, i recently successfully tried it debuggig sas key auth in delta, but am happy to try again.

below code shows where the pyhon sdk for gen2 initialized a blob client internally...

https://github.com/Azure/azure-sdk-for-python/blob/ee8a6b48786d3dd01b60f1648f598cbaad10dff8/sdk/storage/azure-storage-file-datalake/azure/storage/filedatalake/_path_client.py#L91-L92

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2022

In a storage account with hierarchical namespace enabled (ADLS Gen2)

image

And a container called "bucket"

image

I then copied the key from

image

And then ran the tests with

cargo test --features azure

With the environment

AZURE_STORAGE_ACCESS_KEY=REDACTED
AZURE_STORAGE_ACCOUNT=tustvold2
TEST_INTEGRATION=1
OBJECT_STORE_BUCKET=bucket

And they passed successfully

And we can see the directories created by the tests

image

@andrei-ionescu
Copy link
Author

andrei-ionescu commented Dec 7, 2022

@roeap, @tustvold: Thank you guys for your help. I'll check my settings on the ADLS Gen2 account and get back to you when I have more info.

In my case I may not be able to change all these settings on the ADLS Gen2 account.

Please don't close this yet.

Thanks.

@andrei-ionescu
Copy link
Author

@roeap, @tustvold: I managed to have my poc work. I'm able to read from Azure Data Lake Storage Gen2 sing afbss protocol. Thank you guys from helping out. I'm closing the ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants