Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hf:// / "hugging face" integration in datafusion-cli #10720

Open
alamb opened this issue May 30, 2024 · 8 comments · May be fixed by #10792
Open

Implement hf:// / "hugging face" integration in datafusion-cli #10720

alamb opened this issue May 30, 2024 · 8 comments · May be fixed by #10792
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 30, 2024

Is your feature request related to a problem or challenge?

The DuckDB blog shows off a really cool new feature (access remote datasets from hugging face):

https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb

I think doing this with DataFusion would be quite cool and quite simple to implement. Being able to add such support quickly would be a good example of how datafusion's extensibility allows rapid feature development as well as being a cool project.

Describe the solution you'd like

I would like to support this type of query from datafusion-cli:

SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv';

Describe alternatives you've considered

I think we can just follow the same model as the existing object store integration in datafusion-cli

pub(crate) fn register_options(ctx: &SessionContext, scheme: &str) {
// Match the provided scheme against supported cloud storage schemes:
match scheme {
// For Amazon S3 or Alibaba Cloud OSS
"s3" | "oss" | "cos" => {
// Register AWS specific table options in the session context:
ctx.register_table_options_extension(AwsOptions::default())
}
// For Google Cloud Storage
"gs" | "gcs" => {
// Register GCP specific table options in the session context:
ctx.register_table_options_extension(GcpOptions::default())
}
// For unsupported schemes, do nothing:
_ => {}
}
}
pub(crate) async fn get_object_store(
state: &SessionState,
scheme: &str,
url: &Url,
table_options: &TableOptions,
) -> Result<Arc<dyn ObjectStore>, DataFusionError> {
let store: Arc<dyn ObjectStore> = match scheme {
"s3" => {
let Some(options) = table_options.extensions.get::<AwsOptions>() else {
return exec_err!(
"Given table options incompatible with the 's3' scheme"
);
};
let builder = get_s3_object_store_builder(url, options).await?;
Arc::new(builder.build()?)
}
"oss" => {
let Some(options) = table_options.extensions.get::<AwsOptions>() else {
return exec_err!(
"Given table options incompatible with the 'oss' scheme"
);
};
let builder = get_oss_object_store_builder(url, options)?;
Arc::new(builder.build()?)
}
"cos" => {
let Some(options) = table_options.extensions.get::<AwsOptions>() else {
return exec_err!(
"Given table options incompatible with the 'cos' scheme"
);
};
let builder = get_cos_object_store_builder(url, options)?;
Arc::new(builder.build()?)
}
"gs" | "gcs" => {
let Some(options) = table_options.extensions.get::<GcpOptions>() else {
return exec_err!(
"Given table options incompatible with the 'gs'/'gcs' scheme"
);
};
let builder = get_gcs_object_store_builder(url, options)?;
Arc::new(builder.build()?)
}
"http" | "https" => Arc::new(
HttpBuilder::new()
.with_url(url.origin().ascii_serialization())
.build()?,
),
_ => {
// For other types, try to get from `object_store_registry`:
state
.runtime_env()
.object_store_registry
.get_store(url)
.map_err(|_| {
exec_datafusion_err!("Unsupported object store scheme: {}", scheme)
})?
}
};
Ok(store)

And register the hf url with a specially created Http object store instance

Additional context

No response

@alamb alamb added the enhancement New feature or request label May 30, 2024
@xinlifoobar
Copy link
Contributor

xinlifoobar commented May 30, 2024

I am pretty interesting in this idea. I saw duckdb just implements Hugging face Authentications via

 CREATE SECRET hf_token (
    TYPE HUGGINGFACE,
    TOKEN 'your_hf_token'
 );

or

 CREATE SECRET hf_token (
    TYPE HUGGINGFACE,
    PROVIDER CREDENTIAL_CHAIN
 );

Is there an equivalent API in datafusion?

@alamb
Copy link
Contributor Author

alamb commented May 30, 2024

Is there an equivalent API in datafusion?

The equivalent can be specified as part of each external table definition. For example https://datafusion.apache.org/user-guide/cli/datasources.html#s3

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'aws.access_key_id' '******',
    'aws.secret_access_key' '******',
    'aws.region' 'us-east-2'
)
LOCATION 's3://bucket/path/file.parquet';

This isn't quite as good as a secret that can be reused but it should work

@xinlifoobar
Copy link
Contributor

xinlifoobar commented Jun 4, 2024

take

@xinlifoobar
Copy link
Contributor

xinlifoobar commented Jun 5, 2024

Hey @alamb, need some help while implementing the wildcard functions. Did datafusion support select from a wildcard of external files, e.g., select * from 's3://some-bucket/test/*.parquet'. I read through the doc but failed to find a proper example..

@alamb
Copy link
Contributor Author

alamb commented Jun 6, 2024

Hey @alamb, need some help while implementing the wildcard functions. Did datafusion support select from a wildcard of external files, e.g., select * from 's3://some-bucket/test/*.parquet'. I read through the doc but failed to find a proper example..

I don't think it supports wildcards but instead the Listing table to read all the files in a directory (that have the correct extension):

https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table

CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION '/path/to/directory/of/files'
OPTIONS ('has_header' 'true');

@findepi
Copy link
Member

findepi commented Aug 23, 2024

#11979 is probably related

@alamb
Copy link
Contributor Author

alamb commented Sep 24, 2024

Per discussion above, I think the idea is we'll move this kind of feature to other repos / crates

Sorry for the noise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants