Introduce distributed embeddings #974

edknv · 2023-02-07T18:56:47Z

Goals ⚽

There is a package called distributed-embeddings, a library for building large embedding based (e.g. recommender) models in Tensorflow. It's an alternative approach to SOK.

This PR introduces DistributedEmbedding for multi-GPU embedding table support.

Implementation Details 🚧

distributed-embeddings by default will round-robin the entire embedding tables across the GPUs, e.g., the first embedding table on GPU 1, the second one on GPU 2, etc.
In theory the tables can be sharded by using column_slice but this has not been tested thoroughly from Models side.
Most of the logic is for inferring the embedding size from the schema using the cardinality in int_domain (similarly to the existing EmbeddingTable), determining shapes, and translating a dictionary input into an ordered list input (because distributed-embeddings doesn't support dictionaries yet).
From the user perspective, they can replace mm.Embeddings with mm.DistributedEmbeddings in their models when they wish to use multi-GPU embedding tables. (See the unit test for DLRM.)
Depends on upstream fix: Use tf.shape for graph mode support distributed-embeddings#6
Added a Github actions for running unit tests that depend on horovod. distributed-embeddings is for now installed via a script that clones the github repo and installs from source, because there is no pypi package.

Testing Details 🔍

Unit tests: tests/unit/tf/horovod/test_embedding.py
Performance tests: TBD

github-actions · 2023-02-07T19:04:05Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-974

… globally

bschifferer · 2023-03-15T15:51:04Z

The distributed embedding examples uses a custom train step functions:
https://github.com/NVIDIA-Merlin/distributed-embeddings/blob/main/examples/dlrm/main.py#L201-L215

In my understanding, distributed embedding does NOT work with keras model.fit function:
https://github.com/NVIDIA-Merlin/models/pull/974/files#diff-1e42e5c4771f01c26b3c78c545eb341590a4406b2c5af8da0491ab4b7ea51464R80

I think we need the distributed embedding team to review the PR

FDecaYed

Looks good to me.
On model fit support:
Current code using model.fit should work since nothing in distributed-embedding conflicts with keras model fit/compile.

The reason we have custom train_step() example is for when user wants hybrid data/model parallel. The way horovod data parallel support model.fit() api is wrapping optimizer, which will break now if distributed model parallel is also used. By my understanding, merlin-model also have integration with horovod data parallel through distributedoptimizer? if so user could run into problem when they use both integration

We will support model.fit+hvd.distributedoptimizer in next release, so code here should just work. One caveat is the fix(and later version of DE) will be depend on a later horovod version.

Alternatively, I think merlin-model would need to implement custom train_step in block/model using DE? That'll be a much bigger change though.

rnyak · 2023-04-05T16:06:57Z

The PR needs to be update based on the dataloader changes. There is a new version of DE. we need to add an integration test as well to be sure that the functionality is working.

rnyak · 2023-06-05T15:44:01Z

@FDecaYed hello. do you have any updates for this PR? thanks.

FDecaYed · 2023-06-29T05:34:22Z

@rnyak Sorry, this fell off my list.
As of now, DE already added support for model fit. So some of the problems should be gone. I'm willing to jump in and help if needed.

On the other hand, I'm not familiar with merlin models and the dataloader change you mentioned. @edknv do you know what it is and could you help bring the code up to date?

edknv and others added 3 commits February 3, 2023 16:59

rough draft

8b111c9

Introduce distributed embeddings

39cc981

Merge branch 'main' into distributed-embeddings

b46bd20

edknv self-assigned this Feb 7, 2023

edknv added the enhancement New feature or request label Feb 7, 2023

edknv and others added 3 commits February 8, 2023 09:45

install distributed-embeddings package in tox

dd3df0c

check if distributed-embeddings is installed before loading the class…

847c05b

… globally

Merge branch 'main' into distributed-embeddings

7b19fb2

edknv added this to the Merlin 23.03 milestone Feb 8, 2023

edknv and others added 3 commits February 8, 2023 10:47

lint

ded191e

install distributed-embeddings from github repo

99a9b2d

Merge branch 'main' into distributed-embeddings

77a4f5c

edknv mentioned this pull request Feb 14, 2023

Use tf.shape for graph mode support NVIDIA-Merlin/distributed-embeddings#6

Merged

edknv and others added 12 commits February 15, 2023 09:53

graph mode support

674d53f

add distributed-embeddings to ci

4a8bd97

Add multi-gpu ci tests

1f75984

Merge branch 'main' into distributed-embeddings

aa3c7b7

remove graph mode error

a0c58d7

lint and minor rearrangement

9892be8

lint

c76b528

revert to using tensor.shape

4fe3863

whitelist sh in tox

9ac60c7

specify path in gha

3c4b529

Merge branch 'main' into distributed-embeddings

4a8df39

fix horovod cpu gha workflow

8c2e0e6

edknv force-pushed the distributed-embeddings branch from bf1c6e7 to 53d0311 Compare March 7, 2023 18:16

move horovod installation to multi-gpu

5c7fd5a

edknv force-pushed the distributed-embeddings branch from 53d0311 to 5c7fd5a Compare March 7, 2023 18:31

use python -m in tox

36bb606

edknv added 2 commits March 7, 2023 13:12

Remove horovod installation

b69078f

clean up and add documentation

2763cd8

edknv marked this pull request as ready for review March 8, 2023 04:37

edknv changed the title ~~[WIP] Introduce distributed embeddings~~ Introduce distributed embeddings Mar 8, 2023

rnyak requested a review from bschifferer March 8, 2023 17:08

Merge branch 'main' into distributed-embeddings

c916eb0

edknv requested a review from marcromeyn March 10, 2023 03:14

Merge branch 'main' into distributed-embeddings

2e2d449

FDecaYed reviewed Mar 20, 2023

View reviewed changes

edknv modified the milestones: Merlin 23.03, Merlin 23.04 Mar 21, 2023

rnyak modified the milestones: Merlin 23.04, Merlin 23.05 Apr 17, 2023

rnyak modified the milestones: Merlin 23.05, Merlin 23.06 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce distributed embeddings #974

Introduce distributed embeddings #974

edknv commented Feb 7, 2023 •

edited

Loading

github-actions bot commented Feb 7, 2023

bschifferer commented Mar 15, 2023

FDecaYed left a comment

rnyak commented Apr 5, 2023 •

edited

Loading

rnyak commented Jun 5, 2023

FDecaYed commented Jun 29, 2023

Introduce distributed embeddings #974

Are you sure you want to change the base?

Introduce distributed embeddings #974

Conversation

edknv commented Feb 7, 2023 • edited Loading

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

github-actions bot commented Feb 7, 2023

Documentation preview

bschifferer commented Mar 15, 2023

FDecaYed left a comment

Choose a reason for hiding this comment

rnyak commented Apr 5, 2023 • edited Loading

rnyak commented Jun 5, 2023

FDecaYed commented Jun 29, 2023

edknv commented Feb 7, 2023 •

edited

Loading

rnyak commented Apr 5, 2023 •

edited

Loading