Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hugging Face Datasets Plugin #1116

Merged
merged 5 commits into from
Sep 16, 2022

Conversation

esadler-hbo
Copy link
Contributor

@esadler-hbo esadler-hbo commented Jul 30, 2022

TL;DR

Hugging Face provides great packages to make working with state-of-the-art language models easy. Integrating with Flyte would connect ETL to the training and inference of deep learning models seamlessly.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

You can use Hugging face to create high quality embeddings, which is becoming really valuable to a lot of companies. Flyte could elegantly handle the different infrastructure considerations. Notice there is no model training, which makes this workflow especially great.

The first integration is adding Hugging Face's datasets into Flyte's StructuredDatasets. Their datasets is a very performant way to pass data into neural networks. It is based on tf.data.Dataset, but uses Arrow instead of TFRecords. I am excited by the idea of having an ETL job output a pyspark.sql.DataFrame and then doing batch training and batch inference with a Hugging Face dataset seamlessly.

The second integration would be coming up highly scalable task for step 2 in the following workflow:

  1. ETL: prepare dataset of text
  2. Inference: run data through a Hugging Face model pipeline
  3. Upload: Push results to a database that can handle vectors, like Pinecone

I have heard from @gdj0nes that this is common workflow that has infra pain points.

Finally, Hugging Face has a platform where you can save datasets, models, and deploy ML applications. There are opportunities to integrate with their platform that should be mentioned, but are lower priority.

Tracking Issue

https://github.com/flyteorg/flyte/issues/

Follow-up issue

NA
OR
https://github.com/flyteorg/flyte/issues/

@esadler-hbo esadler-hbo changed the title Hugging Face Integration Hugging Face Plugin Jul 30, 2022
@esadler-hbo esadler-hbo marked this pull request as draft July 30, 2022 15:23
@codecov
Copy link

codecov bot commented Jul 30, 2022

Codecov Report

Merging #1116 (91e4d7d) into master (aff19cb) will increase coverage by 0.12%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1116      +/-   ##
==========================================
+ Coverage   68.38%   68.51%   +0.12%     
==========================================
  Files         288      288              
  Lines       25963    26095     +132     
  Branches     2899     2920      +21     
==========================================
+ Hits        17756    17880     +124     
- Misses       7728     7736       +8     
  Partials      479      479              
Impacted Files Coverage Δ
flytekit/tools/repo.py 73.68% <0.00%> (ø)
flytekit/tools/fast_registration.py 89.06% <0.00%> (ø)
tests/flytekit/unit/configuration/test_internal.py 100.00% <0.00%> (ø)
...ests/flytekit/unit/tools/test_fast_registration.py 100.00% <0.00%> (ø)
...ctured_dataset/test_structured_dataset_workflow.py 100.00% <0.00%> (ø)
flytekit/core/interface.py 61.80% <0.00%> (+0.02%) ⬆️
tests/flytekit/unit/core/test_type_engine.py 98.39% <0.00%> (+0.05%) ⬆️
flytekit/clis/sdk_in_container/package.py 96.29% <0.00%> (+0.14%) ⬆️
tests/flytekit/unit/core/test_flyte_pickle.py 91.37% <0.00%> (+0.26%) ⬆️
tests/flytekit/unit/cli/pyflyte/test_run.py 99.20% <0.00%> (+0.32%) ⬆️
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@kumare3
Copy link
Contributor

kumare3 commented Jul 31, 2022

@esadler-hbo you rock! Love all 3 integration goals.

@esadler-hbo & @samhita-alla would you folks be open to writing a blog?

@wild-endeavor
Copy link
Contributor

@esadler-hbo let me take a look at this and try to fix some of the handling around protocol

@esadler-hbo
Copy link
Contributor Author

@wild-endeavor amazing! I’ll get a chance to work on this more this weekend.

@wild-endeavor
Copy link
Contributor

Thanks! Yeah I really need to get to those changes I was talking about today.

@esadler-hbo esadler-hbo changed the title Hugging Face Plugin Datasets Plugin Aug 17, 2022
@esadler-hbo esadler-hbo changed the title Datasets Plugin hugging Face Datasets Plugin Aug 18, 2022
@wild-endeavor
Copy link
Contributor

tagged you also on the other PR @easadler-hbo

@esadler-hbo esadler-hbo force-pushed the huggingface-datasets branch from 3dc8344 to e242ff7 Compare August 31, 2022 20:41
Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esadler-hbo Thanks, Overall LGTM. Some minor comments below.

@esadler-hbo esadler-hbo marked this pull request as ready for review September 15, 2022 02:15
@wild-endeavor wild-endeavor merged commit 21ae290 into flyteorg:master Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants