-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Azure Data Lake Storage Gen2 (aka: ADLS Gen2) in Object Store library #3283
Comments
The ObjectStore library does not currently contain logic for parsing URLs, this is tracked by #2304. However, querying ADLS buckets by path should work correctly. e.g. something like
However, the crate can and is being used with ADLS Gen2, there is in fact code such as here specifically to handle the fact Gen2 buckets have directories. Perhaps @roeap might be able to weigh in here, as he added the initial support for ADLS Gen2 buckets back when this repo lived elsewhere - influxdata/object_store_rs#25 |
The object_store crate should support Gen2 accounts as is. AFAIK, gen2 accounts always also expose the blob APIs, which are "almost" the same, mostly the behavior @tustvold pointed to needs to be handled to be consistent with other stores. To get it to work, just pass in a "blob style" url pointing at the same account / container as the gen2 url. W.r.t the issue linked above, we have an implementation of a more generic url parser, mainly intented to resolve most well known url formats for storage locations. My hope was to move this upstream behind the object store factory API currently part of datafusion. @tustvold, do you think the logic in the link above is "close enough" that we could start iterating in a PR here? If so, I'm happy to propse one. |
That would be awesome, shall I assign #2304 to you then? Perhaps something like?
|
Sure, go ahead and assign :). The api above looks great, maybe a first step could be to implement that on all the builders. An optional second step could be to move the datafusion object store factory trait into object store as well and introduce a builder trait with the two methods from above. Then a factory that maps the url to a specific builder form a map of schemes to implementations should be straight forward to implement? |
@roeap I tried using the code in Delta Lake for accessing Azure Data Lake (the one that you have given above). It all ends up trying to access data in ADLS Gen2 with the wrong domain — In the current implementation, an ADLS Gen2 URL:
Is transformed into:
Which I don't think is correct for the ADLS Gen2 case. It should be:
as defined here: https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/list @tustvold I also tried listing the files with the method you provided. The same result. I goes to the wrong domain. After looking into the |
My knowledge of Azure is fairly limited, but I had thought that ADLS Gen2 buckets still exposed the traditional blob APIs, and the new APIs were strictly opt-in. I think that is what this is saying, but I find the Azure documentation to be pretty hard to follow. FWIW |
@tustvold: HDFS with ADLS Gen2 API is different than HDFS + Azure Blob. Here is list files from ADLS Gen2 with HDFS: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-hdfs-data-lake-storage#get-a-list-of-files-or-directories (see the path used in there). I'm not interested in additional functionalities that ADLS Gen2 offers. What I want is to be able to use the I'm currently able to use |
The link above would suggest that you can address data in ADSL Gen2 using the backwards compatible blob storage APIs which this crate supports? What am I missing here? In particular
Is it possible Spark makes a distinction because this wasn't always the case perhaps? Edit: looking back to the error message you are receiving |
@andrei-ionescu, I think @tustvold is correct, in that gen2 accounts will always also expose the blob apis under the The error message we see suggests that whatever credential is in use likely does not possess list permission on the account or path. these have to be granted in addition to regular read permissions. Recently I stumbled across the same issue :) |
@roeap, I'm not sure that what you're saying is correct in all cases. Even Microsoft has built separate SDKs for blob and ADLS Gen2. See here: https://github.com/Azure/azure-sdk-for-rust/tree/main/sdk If you stumbled across my issue please let me know how you did fix it. What work around did you use? Please provide an example on top of ADLS Gen2. Listing a folder's content with current implementation of In the mean time I'll try add an Azure Data Lake Storage Gen2 implementation into |
Let me double check that ADLS Gen2 still works correctly, I have tested it in the past and it worked, as has @roeap. Assuming it does I would rather not add an additional implementation that just changes to use equivalent but differently named APIs. That's just code duplication for no tangible benefit
Because there are features of blob storage not supported by ADLS Gen2 and vice versa, however, nothing that object_store uses or will likely ever use is not part of that set - see here |
for me it was just about permissions. i had assigned read all rights to a service principal, but needed to assign list permissions separately. the trace above suggests its list that is failing. are you using the same credential that was used to mount the storage with spark? The rust sdk is actually the reason i am so sure about this. a while ago I made some significant contributions to the azure datalake crate and since we wanted to avoid pulling in blob as a dependency, we deliberately decided diverging for the way things are done in other language sdks. That being said, i recently successfully tried it debuggig sas key auth in delta, but am happy to try again. below code shows where the pyhon sdk for gen2 initialized a blob client internally... |
In a storage account with hierarchical namespace enabled (ADLS Gen2) And a container called "bucket" I then copied the key from And then ran the tests with
With the environment
And they passed successfully And we can see the directories created by the tests |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Azure Azure Data Lake Storage Gen2 is available since a good while. Here is Microsoft Azure Data Lake Storage Gen2 REST APIs documentation.
I'm trying to access Azure ADLS Gen2 using
object_store
to read files. Then, as a next step that I want to try is to use DataFusion to read data from ADLS Gen2, process that data, and then write it back to the ADLS Gen2.Looking at these lines of code — object_store/src/azure/mod.rs#L549-L574 — we can see that the ADLS Gen2 format bellow is not supported:
Describe the solution you'd like
Have support for Azure Data Lake Storage Gen2 in
object_store
library.Describe alternatives you've considered
I did try to look into the Azure SDK for Data Lake but it does NOT seem to fit well with the
object_store
library that is used in DataFusion.My end goal is to read data parquet data from ADLS Gen2 with DataFusion.
Additional context
Here is an error that looks like it doesn't find the file and this is because it looks into the wrong place. Instead of looking to
.dfs.core.windows.net
domain it looks to.blob.core.windows.net
.The text was updated successfully, but these errors were encountered: