Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed training using horovod #3533

Merged
merged 4 commits into from
Mar 15, 2021

Conversation

NanoNabla
Copy link
Contributor

As already mentioned in Discourse we implemented distributed training using Horovod.

We tried to keep the changes as minimal as possible. It is still possible to run your undistributed code version. However we also noticed a slightly improvement by using Horovod on one machine.

We copied your train-function to train_with_horovod but I can also provide an merge by using some if FLAGS.horovod. This would reduce code duplication. It is just to improve readability at the moment.

While our code has a good scaling on local stored datasets, we experienced a lack of performance on datasets stored on you globally distributed filesystem (BeeGFS). This could be caused by tf.data.Dataset.shard. We will investigate on this.

doc/TRAINING.rst Outdated Show resolved Hide resolved
doc/TRAINING.rst Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@lissyx
Copy link
Collaborator

lissyx commented Feb 16, 2021

We copied your train-function to train_with_horovod but I can also provide an merge by using some if FLAGS.horovod. This would reduce code duplication. It is just to improve readability at the moment.

I do think this makes it very difficult to review.

One question: how can this interact with TensorFlow 2 ? @reuben is working to upgrade the training path to TF2, so if Horovod breaks or requires significant changes it might be problematic?

@NanoNabla
Copy link
Contributor Author

Horovod works fine with TensorFlow 2. I already took a look at #3485. There should be not problem to get it compatible with reuben's code

@lissyx
Copy link
Collaborator

lissyx commented Feb 16, 2021

Horovod works fine with TensorFlow 2. I already took a look at #3485. There should be not problem to get it compatible with reuben's code

"no problem" as in "it's a piece of cake and can be folded with the tensorflow upgrade" or as in "it will break badly until it is fixed"

I know @reuben was having perfs issues on his PR, it'd be super nice if you have feedback / time to check if horovod helps there as well.

@lissyx
Copy link
Collaborator

lissyx commented Feb 16, 2021

@NanoNabla Also, you'll want @reuben to have a look, but he's generally busy and currently not available for at least one full week, so please be patient :-).

Distributed training using Horovod
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you have a capable compute architecture, it is possible to distribute the training using `Horovod <https://github.com/horovod/horovod>`_. A fast network is recommended.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand that people using Horovod should know the underlying requirements, I think a fast network might be troublesome to some users: we have had requests of people to use distributed training to leverage several GPUs on only Gigabit Ethernet networks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still doing experiments, but ethernet might be sufficient for small setups, e.g. using two or three systems.


.. code-block:: bash

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python3 DeepSpeech.py --train_files [...] --horovod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth linking some official stable crash-course "how to use horovod" here ?

@@ -76,6 +76,10 @@ def main():
'tensorflow == 1.15.4'
]

horovod_pypi_dep = [
'horovod[tensorflow] == 0.21.3'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I have not checked the pypi repo) how does that works when tensorflow-gpu is needed, is this horovod[tensorflow] still working as expected ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you want to take a look here https://horovod.readthedocs.io/en/stable/install.html
There is only horovod[tensorflow] for TensorFlow bindings and should be the same as HOROVOD_WITH_TENSORFLOW=1 pip install horovod.

doc/TRAINING.rst Outdated

For more information about setup or tuning of Horovod please visit `Horovod's Github <https://github.com/horovod/horovod>`_.

To train on 4 machines using 4 GPUs each:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are there any requirements:

  • same GPUs ?
  • same OS ?
  • same drivers ?
  • same number of GPUs on each system ?

Copy link
Contributor Author

@NanoNabla NanoNabla Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, Horovod is expected to run on highly heterogeneous systems, it is not documented well what works.
I would not support something else than different GPUs per machine.This can be controlled by the number of processes per host, since every process has only one GPU pinned to it.
horovodrun -np SUM_OF_PROCNUM -H server1:NUMPROC_1,server2:NUMPROC_2,server3 [...]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can add link to Horovod doc covering those aspects ? And articulate that we cannot support anything besides homogenous configurations and that heterogenous cases will have to deal with their mess ? :)

Copy link
Collaborator

@lissyx lissyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's much easier to review. I only had a quick look though, I still need to go over in details of the training loop as well as of the dataset handling.

@lissyx
Copy link
Collaborator

lissyx commented Feb 18, 2021

@NanoNabla @AndreasGocht Dumb question, but are there limitations that might make Horovod not working within a Docker context?

@AndreasGocht
Copy link

Hey,

I have never tried, but there is no principle limitation I am aware of. However, using MPI, there might be some gotchas:

  • If you use multiple Container the Containers need to see each other. There is a StackOverflow entry which discusses this.
  • Some MPI implementations try to use Kernel modules doing zero copies between processes. I am not sure how this works with Docker. However, as I remember they do have a fallback (though they might not be as fast)

I am not sure if Gloo is an alternative here.

BTW: do you know Singularity? It's an HPC container environment, so it focuses on performance. It might be worth a look.

Best,

Andreas

@NanoNabla
Copy link
Contributor Author

NanoNabla commented Feb 18, 2021

As read in the docs Horovod should run within a Docker.
As docker is not supported by our hpc system and I don't have to capability to test it locally I can not give first hand support for this.

Other people on our HPC system use Horovod within Singularity which is a HPC container environment simular to docker and mostly compatible as far as I know.

@lissyx
Copy link
Collaborator

lissyx commented Feb 18, 2021

I guess I should have been more clear in my question, it's not in the context of leveraging distributed training, but on a single machine: I was thinking of giving a test run between non horovod training and horovod enabled here on my 2x RTX 2080 Ti. So, I would not care about MPI and others.

@NanoNabla
Copy link
Contributor Author

If you run both processes in the same container, I do not see any problems since there communication and so on.
Just install DeepSpeech as usual, but set set DS_WITH_HOROVOD. It will take some time longer, since Horovod will be build from PyPi (no prebuild packages).
Without having MPI installed it should use Gloo and NCCL (already configured in your docker example?) which is fine in your setup.

Finally run
horovodrun -np 2 python DeepSpeech --hororov [...]

To train on 4 machines using 4 GPUs each:
Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine).
However, this can cause unpredictable problems and user interaction in training code is needed.
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine).
However, this can cause unpredictable problems and user interaction in training code is needed.
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine.
The only exception is different number of GPUs per machine, since this can be controlled by ``horovodrun -H``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No risk of improper interactions with batch size for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get your question at all.

Specified batch size via CLI will treated as batch size for each worker not for the machine or complete system, Therefore we do learning rate rescaling.
If all gpus are equal I do not see any problems, since all get the same batch size.

If you change code it would be possible to set different batch sizes on each gpu (e.g. for different memory or load balancing). This would open doors for load balance problem, you do not what to support.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all gpus are equal I do not see any problems, since all get the same batch size.

batch size applies equally to all GPUs of one machine?

Sorry but the few hovorovrun examples makes it hard to grasp the meaning of the commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Horovod by itself does nothing with the batch size and therefore also horovodrun
The batch size used on each horovod process and so every GPU on every machine is the batch size you specified like normally with DeepSpeech.py --train_batch_size and is only used on creating dataset

train_set = create_dataset(FLAGS.train_files.split(','),
                           batch_size=FLAGS.train_batch_size,
                           epochs=FLAGS.epochs,
                           augmentations=Config.augmentations,
                           cache_path=FLAGS.feature_cache,
                           train_phase=True,
                           exception_box=exception_box,
                           process_ahead=Config.num_devices * FLAGS.train_batch_size * 2,
                           reverse=FLAGS.reverse_train,
                           limit=FLAGS.limit_train,
                           buffering=FLAGS.read_buffer,
                           split_dataset=split_dataset)

https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L423 (Is it possible to link the code here directly?)

So, your effective batch size for training on which the Optimizer is applyed is FLAGS.train_batch_size * NUM_PROC where NUM_PROC is the number of used horovod processes which is equally to the number gpus on your whole setup. In Horovod terminus, equally to HPC MPI terminus (horovod.size())

To prevent network convergence problems because of this bigger effective batch size we scale the learning rate as recommented by the horovod devs
with
FLAGS.learning_rate * NUM_PROC
https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L487

In theory horovod has no problem if you apply different batch sizes to each gpus. In practice you want to make sure every process finishes with its batch at about the same time (load balance). If one process is much late horovod error handling will take action.

@NanoNabla
Copy link
Contributor Author

We know you only have sparse time but what is the current state of checking this PR? Can we support it e.g. providing further information and explanations? Are their preconditions for you like TF2 integration?

We already checked the problem in #3485 because @lissyx mentioned it here but we don't have any clue how to fix it. Horovod does not change performance on a single worker (single GPU). You could just blow your batch size by using multiple horovod workers (and so GPUs) to it's previous value but that's not what you want by efficiency.

@lissyx
Copy link
Collaborator

lissyx commented Mar 9, 2021

We know you only have sparse time but what is the current state of checking this PR? Can we support it e.g. providing further information and explanations? Are their preconditions for you like TF2 integration?

Well, I asked about tf2 specifically because I know it will take me some time to properly test this, and I wanted to avoid you waste your time on something that would be requiring complete rewrite. Unfortunately, I have not yet been able to dedicate time, but if planets gets aligned properly tonight, this might change quickly.

@NanoNabla
Copy link
Contributor Author

Just say the coordinates and I'll kick the planets for you. :)

@reuben
Copy link
Contributor

reuben commented Mar 9, 2021

I think if this is passing training tests and doesn't break compatibility with loading existing checkpoints then it's OK to land as-is. The TF2 path is looking like an almost complete training code rewrite anyway, so I don't think it makes sense to block this PR on that.

Copy link
Contributor

@reuben reuben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. The only question I have is about checkpoint compatibility. It doesn't look like it's broken by this, but I haven't tested. If it is broken, then we should document at least in the --horovod flag doc, similar to --train_cudnn.

@NanoNabla
Copy link
Contributor Author

What was the result of your CI tests?

I checked checkpointing compatibility. I ran 2 nodes each 2 GPUs with horovod and used this checkpoints for 1 gpu on one node without horovod successfully.

@lissyx
Copy link
Collaborator

lissyx commented Mar 15, 2021

I think if this is passing training tests and doesn't break compatibility with loading existing checkpoints then it's OK to land as-is. The TF2 path is looking like an almost complete training code rewrite anyway, so I don't think it makes sense to block this PR on that.

Good for me.

@lissyx
Copy link
Collaborator

lissyx commented Mar 15, 2021

I checked checkpointing compatibility. I ran 2 nodes each 2 GPUs with horovod and used this checkpoints for 1 gpu on one node without horovod successfully.

Does that means "yes" to reuben's question ?

@lissyx
Copy link
Collaborator

lissyx commented Mar 15, 2021

What was the result of your CI tests?

None yet, but I just triggered TC run (maybe one of the last, see #3317)

@lissyx
Copy link
Collaborator

lissyx commented Mar 15, 2021

It ain't breaking badly, let's merge.

@lissyx lissyx merged commit bde9a1d into mozilla:master Mar 15, 2021
@NanoNabla
Copy link
Contributor Author

Thanks @lissyx and @reuben !

I checked checkpointing compatibility. I ran 2 nodes each 2 GPUs with horovod and used this checkpoints for 1 gpu on one node without horovod successfully.

Does that means "yes" to reuben's question ?

It was a yes. It works. No problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants