-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce distributed embeddings #974
base: main
Are you sure you want to change the base?
Conversation
Documentation preview |
bf1c6e7
to
53d0311
Compare
53d0311
to
5c7fd5a
Compare
The distributed embedding examples uses a custom train step functions: In my understanding, distributed embedding does NOT work with keras model.fit function: I think we need the distributed embedding team to review the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
On model fit support:
Current code using model.fit should work since nothing in distributed-embedding conflicts with keras model fit/compile.
The reason we have custom train_step() example is for when user wants hybrid data/model parallel. The way horovod data parallel support model.fit() api is wrapping optimizer, which will break now if distributed model parallel is also used. By my understanding, merlin-model also have integration with horovod data parallel through distributedoptimizer? if so user could run into problem when they use both integration
We will support model.fit+hvd.distributedoptimizer in next release, so code here should just work. One caveat is the fix(and later version of DE) will be depend on a later horovod version.
Alternatively, I think merlin-model would need to implement custom train_step in block/model using DE? That'll be a much bigger change though.
The PR needs to be update based on the dataloader changes. There is a new version of DE. we need to add an integration test as well to be sure that the functionality is working. |
@FDecaYed hello. do you have any updates for this PR? thanks. |
@rnyak Sorry, this fell off my list. On the other hand, I'm not familiar with merlin models and the dataloader change you mentioned. @edknv do you know what it is and could you help bring the code up to date? |
Part of NVIDIA-Merlin/Merlin#733.
Goals ⚽
There is a package called distributed-embeddings, a library for building large embedding based (e.g. recommender) models in Tensorflow. It's an alternative approach to SOK.
This PR introduces
DistributedEmbedding
for multi-GPU embedding table support.Implementation Details 🚧
distributed-embeddings
by default will round-robin the entire embedding tables across the GPUs, e.g., the first embedding table on GPU 1, the second one on GPU 2, etc.column_slice
but this has not been tested thoroughly from Models side.int_domain
(similarly to the existingEmbeddingTable
), determining shapes, and translating a dictionary input into an ordered list input (becausedistributed-embeddings
doesn't support dictionaries yet).mm.Embeddings
withmm.DistributedEmbeddings
in their models when they wish to use multi-GPU embedding tables. (See the unit test for DLRM.)distributed-embeddings
is for now installed via a script that clones the github repo and installs from source, because there is no pypi package.Testing Details 🔍
Unit tests:
tests/unit/tf/horovod/test_embedding.py
Performance tests: TBD