-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement hf://
/ "hugging face" integration in datafusion-cli
#10720
Comments
I am pretty interesting in this idea. I saw duckdb just implements Hugging face Authentications via CREATE SECRET hf_token (
TYPE HUGGINGFACE,
TOKEN 'your_hf_token'
); or CREATE SECRET hf_token (
TYPE HUGGINGFACE,
PROVIDER CREDENTIAL_CHAIN
); Is there an equivalent API in datafusion? |
The equivalent can be specified as part of each external table definition. For example https://datafusion.apache.org/user-guide/cli/datasources.html#s3 CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
'aws.access_key_id' '******',
'aws.secret_access_key' '******',
'aws.region' 'us-east-2'
)
LOCATION 's3://bucket/path/file.parquet'; This isn't quite as good as a secret that can be reused but it should work |
take |
Hey @alamb, need some help while implementing the wildcard functions. Did datafusion support select from a wildcard of external files, e.g., select * from 's3://some-bucket/test/*.parquet'. I read through the doc but failed to find a proper example.. |
I don't think it supports wildcards but instead the Listing table to read all the files in a directory (that have the correct extension): https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION '/path/to/directory/of/files'
OPTIONS ('has_header' 'true'); |
#11979 is probably related |
Per discussion above, I think the idea is we'll move this kind of feature to other repos / crates Sorry for the noise |
Is your feature request related to a problem or challenge?
The DuckDB blog shows off a really cool new feature (access remote datasets from hugging face):
https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb
I think doing this with DataFusion would be quite cool and quite simple to implement. Being able to add such support quickly would be a good example of how datafusion's extensibility allows rapid feature development as well as being a cool project.
Describe the solution you'd like
I would like to support this type of query from
datafusion-cli
:Describe alternatives you've considered
I think we can just follow the same model as the existing object store integration in datafusion-cli
datafusion/datafusion-cli/src/object_storage.rs
Lines 419 to 496 in 088ad01
And register the
hf
url with a specially createdHttp
object store instanceAdditional context
No response
The text was updated successfully, but these errors were encountered: