Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce distributed embeddings #974

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

edknv
Copy link
Contributor

@edknv edknv commented Feb 7, 2023

Part of NVIDIA-Merlin/Merlin#733.

Goals ⚽

There is a package called distributed-embeddings, a library for building large embedding based (e.g. recommender) models in Tensorflow. It's an alternative approach to SOK.

This PR introduces DistributedEmbedding for multi-GPU embedding table support.

Implementation Details 🚧

  • distributed-embeddings by default will round-robin the entire embedding tables across the GPUs, e.g., the first embedding table on GPU 1, the second one on GPU 2, etc.
  • In theory the tables can be sharded by using column_slice but this has not been tested thoroughly from Models side.
  • Most of the logic is for inferring the embedding size from the schema using the cardinality in int_domain (similarly to the existing EmbeddingTable), determining shapes, and translating a dictionary input into an ordered list input (because distributed-embeddings doesn't support dictionaries yet).
  • From the user perspective, they can replace mm.Embeddings with mm.DistributedEmbeddings in their models when they wish to use multi-GPU embedding tables. (See the unit test for DLRM.)
  • Depends on upstream fix: Use tf.shape for graph mode support distributed-embeddings#6
  • Added a Github actions for running unit tests that depend on horovod. distributed-embeddings is for now installed via a script that clones the github repo and installs from source, because there is no pypi package.

Testing Details 🔍

Unit tests: tests/unit/tf/horovod/test_embedding.py
Performance tests: TBD

@github-actions
Copy link

github-actions bot commented Feb 7, 2023

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-974

@edknv edknv self-assigned this Feb 7, 2023
@edknv edknv added the enhancement New feature or request label Feb 7, 2023
@edknv edknv added this to the Merlin 23.03 milestone Feb 8, 2023
@edknv edknv force-pushed the distributed-embeddings branch from bf1c6e7 to 53d0311 Compare March 7, 2023 18:16
@edknv edknv force-pushed the distributed-embeddings branch from 53d0311 to 5c7fd5a Compare March 7, 2023 18:31
@edknv edknv marked this pull request as ready for review March 8, 2023 04:37
@edknv edknv changed the title [WIP] Introduce distributed embeddings Introduce distributed embeddings Mar 8, 2023
@rnyak rnyak requested a review from bschifferer March 8, 2023 17:08
@edknv edknv requested a review from marcromeyn March 10, 2023 03:14
@bschifferer
Copy link
Contributor

The distributed embedding examples uses a custom train step functions:
https://github.com/NVIDIA-Merlin/distributed-embeddings/blob/main/examples/dlrm/main.py#L201-L215

In my understanding, distributed embedding does NOT work with keras model.fit function:
https://github.com/NVIDIA-Merlin/models/pull/974/files#diff-1e42e5c4771f01c26b3c78c545eb341590a4406b2c5af8da0491ab4b7ea51464R80

I think we need the distributed embedding team to review the PR

Copy link

@FDecaYed FDecaYed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
On model fit support:
Current code using model.fit should work since nothing in distributed-embedding conflicts with keras model fit/compile.

The reason we have custom train_step() example is for when user wants hybrid data/model parallel. The way horovod data parallel support model.fit() api is wrapping optimizer, which will break now if distributed model parallel is also used. By my understanding, merlin-model also have integration with horovod data parallel through distributedoptimizer? if so user could run into problem when they use both integration

We will support model.fit+hvd.distributedoptimizer in next release, so code here should just work. One caveat is the fix(and later version of DE) will be depend on a later horovod version.

Alternatively, I think merlin-model would need to implement custom train_step in block/model using DE? That'll be a much bigger change though.

@edknv edknv modified the milestones: Merlin 23.03, Merlin 23.04 Mar 21, 2023
@rnyak
Copy link
Contributor

rnyak commented Apr 5, 2023

The PR needs to be update based on the dataloader changes. There is a new version of DE. we need to add an integration test as well to be sure that the functionality is working.

@rnyak rnyak modified the milestones: Merlin 23.04, Merlin 23.05 Apr 17, 2023
@rnyak rnyak modified the milestones: Merlin 23.05, Merlin 23.06 May 15, 2023
@rnyak
Copy link
Contributor

rnyak commented Jun 5, 2023

@FDecaYed hello. do you have any updates for this PR? thanks.

@FDecaYed
Copy link

@rnyak Sorry, this fell off my list.
As of now, DE already added support for model fit. So some of the problems should be gone. I'm willing to jump in and help if needed.

On the other hand, I'm not familiar with merlin models and the dataloader change you mentioned. @edknv do you know what it is and could you help bring the code up to date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants