-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement distributed training using horovod #3533
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -196,6 +196,27 @@ python3 DeepSpeech.py --train_files ./train.csv --dev_files ./dev.csv --test_fil | |
|
||
On a Volta generation V100 GPU, automatic mixed precision speeds up DeepSpeech training and evaluation by ~30%-40%. | ||
|
||
Distributed training using Horovod | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
If you have a capable compute architecture, it is possible to distribute the training using `Horovod <https://github.com/horovod/horovod>`_. A fast network is recommended. | ||
Horovod is capable of using MPI and NVIDIA's NCCL for highly optimized inter-process communication. | ||
It also offers `Gloo <https://github.com/facebookincubator/gloo>`_ as an easy-to-setup communication backend. | ||
|
||
For more information about setup or tuning of Horovod please visit `Horovod's documentation <https://horovod.readthedocs.io/en/stable/summary_include.html>`_. | ||
|
||
Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine). | ||
However, this can cause unpredictable problems and user interaction in training code is needed. | ||
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💯 |
||
The only exception is different number of GPUs per machine, since this can be controlled by ``horovodrun -H``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No risk of improper interactions with batch size for example? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not get your question at all. Specified batch size via CLI will treated as batch size for each worker not for the machine or complete system, Therefore we do learning rate rescaling. If you change code it would be possible to set different batch sizes on each gpu (e.g. for different memory or load balancing). This would open doors for load balance problem, you do not what to support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
batch size applies equally to all GPUs of one machine? Sorry but the few There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Horovod by itself does nothing with the batch size and therefore also
https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L423 (Is it possible to link the code here directly?) So, your effective batch size for training on which the Optimizer is applyed is To prevent network convergence problems because of this bigger effective batch size we scale the learning rate as recommented by the horovod devs In theory horovod has no problem if you apply different batch sizes to each gpus. In practice you want to make sure every process finishes with its batch at about the same time (load balance). If one process is much late horovod error handling will take action. |
||
|
||
Detailed documentation how to run Horovod is provided `here <https://horovod.readthedocs.io/en/stable/running.html>`_. | ||
The short command to train on 4 machines using 4 GPUs each: | ||
|
||
.. code-block:: bash | ||
|
||
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python3 DeepSpeech.py --train_files [...] --horovod | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe worth linking some official stable crash-course "how to use horovod" here ? |
||
|
||
Checkpointing | ||
^^^^^^^^^^^^^ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -76,6 +76,10 @@ def main(): | |
'tensorflow == 1.15.4' | ||
] | ||
|
||
horovod_pypi_dep = [ | ||
'horovod[tensorflow] == 0.21.3' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I have not checked the pypi repo) how does that works when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe you want to take a look here https://horovod.readthedocs.io/en/stable/install.html |
||
] | ||
|
||
# Due to pip craziness environment variables are the only consistent way to | ||
# get options into this script when doing `pip install`. | ||
tc_decoder_artifacts_root = os.environ.get('DECODER_ARTIFACTS_ROOT', '') | ||
|
@@ -94,6 +98,12 @@ def main(): | |
else: | ||
install_requires = install_requires + tensorflow_pypi_dep | ||
|
||
if os.environ.get('DS_WITH_HOROVOD', ''): | ||
install_requires = install_requires + horovod_pypi_dep | ||
else: | ||
install_requires = install_requires | ||
|
||
|
||
setup( | ||
name='deepspeech_training', | ||
version=version, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand that people using Horovod should know the underlying requirements, I think
a fast network
might be troublesome to some users: we have had requests of people to use distributed training to leverage several GPUs on only Gigabit Ethernet networks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still doing experiments, but ethernet might be sufficient for small setups, e.g. using two or three systems.