-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Data parallel training freezes due to different number of batches #75
Comments
@bschifferer will try to set the seed before the dataloader, and check it out. |
@jperez999 I provided an example with Merlin Models: I add the seed_fn
When I check the print statement, I get following split:
|
@jperez999 @benfred @rjzamora Trying to figure out if this has always been broken or if it's a recent change. |
So this is not the correct way to use the merlin dataloader with horovod. This requires a lot more background information. You should never be creating dataloaders in a for loop. When dealing with horovod you should follow the example in the tests in nvtabular https://github.com/NVIDIA-Merlin/NVTabular/blob/main/tests/unit/loader/test_tf_dataloader.py#L537. Notice that to use horovod you need to use the horovodrun subprocess. And you need to ensure you are also using the supplied wrapper as it adds the necessary variables for mpi to run under the hood, located here: https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/multi-gpu-movielens/hvd_wrapper.sh. |
@bschifferer can you follow up and make sure we're using HV correctly. We'll probably need to find a way to make clear to our customers how to properly set this up, even if it's just giving the links that @jperez999 shared more highlights. |
@jperez999 Is there a way to produce equal number of batches so that the workload is balanced across workers? Although nvtabular seems to produce equal-sized batches in tf_trainer.py, the number of batches are different (hence the need for for batch, (examples, labels) in enumerate(train_dataset_tf):
loss_value = training_step(examples, labels, batch == 0)
print(f"There are {batch} batches in worker {hvd.local_rank()}.")
#hvd.join()
Without One workaround I have is to repartition the dataset with something like train = Dataset(output_path / "train" / "*.parquet")
ddf = train.to_ddf().repartition(npartitions=hvd.size())
train = Dataset(ddf, schema=train.schema) but I'm wondering if there is a better way to do this in the dataloader. Edit: I ran |
So I just ran this unit test: pytest tests/unit/loader/test_tf_dataloader.py::test_horovod_multigpu And it runs as expected. There are five partitions spread across two workers, so naturally one worker will get more partitions than the other. The dataloader is designed like this. Now what can happens is that, depending on the batch size, you can end up slicing your partition into much smaller pieces. This could mean that one partition could give you 100+ batches and if that worker has one more partition than the other workers... then you will end up with 100 extra batches in that worker. Remember that the split does not happen based on batches... it happens based on partitions. Those partitions are subsequently broken down into chunks of batch_size. You can try to repartition the dataset, so that it will put out the same number of partitions in each worker. But even then... you would need to guarantee that the partitions are all the same size. This is a gotcha we have known since the creation of the dataloader. because partitions are not merged across files. Lets say you have a dataset that has 2 files and in file one there 225 rows, and your partition size if 50 rows, then that first file we have 5 partitions (even though the last partition is half full). Then file 2 has only 150 rows, then this file will have 3 partitions and you will find yourself in a situation where one worker will get four full partitions and the other will get 3 full one partial. Now you can extrapolate that to a scenario where all files end with half partitions... you see how you can find yourself with non-full partitions littered across your dataset? |
@jperez999 @EvenOldridge thanks. |
Even if it is a single file, I can have different number of batches. The
Output:
Changing part_size to 200MB
Not providing any part_size:
|
Let's repartition the dataset based on here
Output:
Output with part_size=200MB:
Output with no partsize:
|
Having multiple files
Output:
|
Having multiple files with repartition:
Output:
|
If we use NVTabular to process multiple input files, it will generate multiple output files with the same shapes:
Output:
|
Memory Foot Print
|
Memory Foot Print II Dataset Creation
Repartition
Without Repartition
|
|
So I have done a preliminary investigation into this issue based on the reproducer and I have found that the following code, using plain tensorflow and horovod does work as expected. In the following example, I used only two processes because I have only two GPUs available on my dev machine. Using the code provided previously in this thread, I was still able to reproduce the error (hang during training). However, the following code runs successfully:
In the code you get the following output:
Notice that one of the processes has ~50 less batches than the other. And we are still able to complete training. The process with less batches is able to complete and the other continues more iterations until it has completed training on its batches. This leads me to believe the problem lies in the merlin models code. I have started my investigation into this and I have found that there are a few differences in how the code is being run. First, in merlin models we are using callback and secondly I was unable to find anywhere in the code where the hvd.join() command was used, third, the merlin models code relies on the super().fit call to the keras model class. I tried inserting the hvd.join command just after the super().fit call is made but that did not fix the issue (code still hangs). I will continue investigating, this is however, IMO, not a dataloader issue. |
Bug description
In data parallel training, we start multiple workers with different initialization of the dataloader and train with horovod. After each batch update, the parameters are synced. Merlin dataloader has different number of batches depending on the selected rank. Therefore, some workers finishes the training loop and other workers are still training - this causes horovod to freeze.
Output:
The text was updated successfully, but these errors were encountered: