Multi-node multi-GPU training wont run after loading images #12428

Tmkilduff · 2023-11-25T20:41:01Z

Tmkilduff
Nov 25, 2023

Hello,
Thank you for taking the time to read my post. I have trying to run training on a supercomputer using torch.distributed.run for a multi-node, multi-GPU setup for over 130,000 images with 1536x2048 resolution. It seems that I am having an issue with the nodes communicating with each other to actually start the training. In this example, I have 2 nodes, 1 GPU-per-node. I use the following bash script with SLURM commands to request the resources necessary for this job:

#!/bin/bash

#SBATCH --job-name=yolov5_training
#SBATCH --partition=xeon-g6-volta
#SBATCH --output=./jobs/train%A.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:volta:1
#SBATCH --exclusive

#Load necessary modules
source /etc/profile
module load anaconda/2023a-pytorch cuda/11.8

srun --nodes=$SLURM_NNODES --ntasks-per-node=$SLURM_NTASKS_PER_NODE bash -c '
#Get the total number of nodes allocated
N=$(scontrol show hostname | wc -l)

#Get the hostname of the current node
current_node=$(hostname)

#Assign node_rank based on the current node
if [ "$current_node" = "$(scontrol show hostnames | head -n1)" ]; then
node_rank=0 # Set node_rank to 0 for the master node
else
# Determine the node rank for non-master nodes
node_rank=$(($(scontrol show hostnames | grep -n "$current_node" | cut -d":" -f1) - 1))
fi

#Print the node_rank for each node
echo "Node $current_node has rank $node_rank"

#Set the master address and port only for the first task (index 0)
if [ $node_rank -eq 0 ]; then
MASTER_ADDR=$(hostname -I)
rm -f shared_ip.sh
echo "export MASTER_ADDR=$MASTER_ADDR" > shared_ip.sh
else
# For other tasks, wait for a short duration to ensure the master has set the variable
sleep 5
fi

#Wait for the master to set the variable
while [ ! -f shared_ip.sh ]; do
sleep 5
done

#Source the shared file to set the MASTER_ADDR variable
source shared_ip.sh
echo "MASTER_ADDR="$MASTER_ADDR

MY_ADDR=$(hostname -I)
echo "MY_ADDRESS="$MY_ADDR
MASTER_PORT=43829

echo python -m torch.distributed.run
--nproc_per_node $SLURM_NTASKS_PER_NODE
--nnodes $SLURM_NNODES
--node_rank $node_rank
--master_addr "$MASTER_ADDR"
--master_port $MASTER_PORT
train.py --data training.yaml --weights yolov5s.pt --img 2048 --project 'runs/train/11-15'

echo "Begin Training: Node "$node_rank

python -m torch.distributed.run
--nproc_per_node $SLURM_NTASKS_PER_NODE
--nnodes $SLURM_NNODES
--node_rank $node_rank
--master_addr "$MASTER_ADDR"
--master_port $MASTER_PORT
train.py --data training.yaml --weights yolov5s.pt --img 2048 --project 'runs/train/11-15'

In this bash script, I allocate the resources, retrieve the Master Address for the master node and share it with the secondary node. Here are the outputs for each node:

Node 0:
Node d-8-3-2 has rank 0
MASTER_ADDR=172.31.130.37
MY_ADDRESS=172.31.130.37
python -m torch.distributed.run --nproc_per_node 1 --nnodes 2 --node_rank 0 --master_addr 172.31.130.37 --master_port 43829 train.py --data training.yaml --weights yolov5s.pt --img 2048 --project runs/train/11-15
Begin Training: Node 0

Node 1:
Node d-8-4-1 has rank 1
MASTER_ADDR=172.31.130.37
MY_ADDRESS=172.31.130.38
python -m torch.distributed.run --nproc_per_node 1 --nnodes 2 --node_rank 1 --master_addr 172.31.130.37 --master_port 43829 train.py --data training.yaml --weights yolov5s.pt --img 2048 --project runs/train/11-15
Begin Training: Node 1

Everything seems to be correct so far. The master address is being shared correctly and the node ranks are displayed properly. The master node then outputs the training (weights, data, epochs, etc.) and loads in the images.

train: weights=yolov5s.pt, cfg=, data=training.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=16, imgsz=2048, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train/11-15, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (offline), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v7.0-232-g1c60c53 Python-3.9.16 torch-2.0.0+cu117 CUDA:0 (Tesla V100-PCIE-32GB, 32501MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train/11-15', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=1

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 5 6 7 8 9 10 11 -1 1 12 [-1, 6] 1 13 14 15 -1 1 16 [-1, 4] 1 17 18 19 [-1, 14] 1 20 21 22 [-1, 10] 1 23 24 [17, 20, 23] 1 Model summary: 3520 models.common.Conv [3, 32, 6, 2, 2]
18560 models.common.Conv [32, 64, 3, 2]
18816 models.common.C3 [64, 64, 1]
73984 models.common.Conv [64, 128, 3, 2]
-1 2 115712 models.common.C3 [128, 128, 2]
-1 1 295424 models.common.Conv [128, 256, 3, 2]
-1 3 625152 models.common.C3 [256, 256, 3]
-1 1 1180672 models.common.Conv [256, 512, 3, 2]
-1 1 1182720 models.common.C3 [512, 512, 1]
-1 1 656896 models.common.SPPF [512, 512, 5]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 361984 models.common.C3 [512, 256, 1, False]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 90880 models.common.C3 [256, 128, 1, False]
-1 1 147712 models.common.Conv [128, 128, 3, 2]
0 models.common.Concat [1]
-1 1 296448 models.common.C3 [256, 256, 1, False]
-1 1 590336 models.common.Conv [256, 256, 3, 2]
0 models.common.Concat [1]
-1 1 1182720 models.common.C3 [512, 512, 1, False]
16182 models.yolo.Detect [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
214 layers, 7022326 parameters, 7022326 gradients, 15.9 GFLOPs

Transferred 343/349 items from yolov5s.pt
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.0005), 60 bias
train: Scanning /data1/groups/arclab_optnav/yolov5/training/labels/train2023/trial-11-02-23/povray_images.cache... 133669 images, 44 backgrounds, 0 cotrain: Scanning /data1/groups/arclab_optnav/yolov5/training/labels/train2023/trial-11-02-23/povray_images.cache... 133669 images, 44 backgrounds, 0 corrupt: 100%|██████████| 133713/133713 [00:00<?, ?it/s]
train: Scanning /data1/groups/arclab_optnav/yolov5/training/labels/train2023/trial-11-02-23/povray_images.cache... 133669 images, 44 backgrounds, 0 cotrain: Scanning /data1/groups/arclab_optnav/yolov5/training/labels/train2023/trial-11-02-23/povray_images.cache... 133669 images, 44 backgrounds, 0 corrupt: 100%|██████████| 133713/133713 [00:00<?, ?it/s]
val: Scanning /data1/groups/arclab_optnav/yolov5/training/labels/validation/trial-11-02-23.cache... 32550 images, 13121 backgrounds, 0 corrupt: 100%|█val: Scanning /data1/groups/arclab_optnav/yolov5/training/labels/validation/trial-11-02-23.cache... 32550 images, 13121 backgrounds, 0 corrupt: 100%|██████████| 45671/45671 [00:00<?, ?it/s]

AutoAnchor: 4.03 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/11-15/exp7/labels.jpg...

One thing I'll ask real quick: Given that we are doing multi-gpu training and that the master node is the only machine outputting information, should it show all the GPUs? It only shows one for the output. Also, I noticed that it outputs the scanning for train 4x and val 2x. Is that correct?
Anyway, the error occurs after all of that due to a time out:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800081 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800081 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 95313) of binary: /state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/bin/python
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error: remote process exited or there was a network error, NCCL version 2.14.3
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 172.31.130.38<42217> with error 12, opcode 32753, len 0, vendor err 129
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 95884) of binary: /state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/bin/python
Traceback (most recent call last):
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in
Traceback (most recent call last):
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in
main()
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
main()
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
return f(*args, **kwargs)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
run(args)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
elastic_launch(
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
return launch_agent(self._config, self._entrypoint, list(args))
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a-pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

======================================================

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-11-15_23:13:49
host : d-8-3-2.supercloud.mit.edu
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 95884)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 95884

======================================================

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

======================================================

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-11-15_23:09:34
host : d-8-4-1.supercloud.mit.edu
rank : 1 (local_rank: 0)
exitcode : -6 (pid: 95313)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 95313

======================================================

srun: error: d-8-4-1: task 1: Exited with exit code 1
srun: error: d-8-3-2: task 0: Exited with exit code 1

Would somebody be able to assist me in figuring out the issue? My guess is that the nodes are not correctly communicating with each other. I have been struggling with this for weeks now :( Training on a single node works no problem, but multi-node seems to be an issue.

victorvargass · 2024-10-22T02:51:30Z

victorvargass
Oct 22, 2024

Can I train yolov11 with multigpu on slurm??

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node multi-GPU training wont run after loading images #12428

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Multi-node multi-GPU training wont run after loading images #12428

Tmkilduff Nov 25, 2023

In this bash script, I allocate the resources, retrieve the Master Address for the master node and share it with the secondary node. Here are the outputs for each node:

Everything seems to be correct so far. The master address is being shared correctly and the node ranks are displayed properly. The master node then outputs the training (weights, data, epochs, etc.) and loads in the images.

Would somebody be able to assist me in figuring out the issue? My guess is that the nodes are not correctly communicating with each other. I have been struggling with this for weeks now :( Training on a single node works no problem, but multi-node seems to be an issue.

Replies: 1 comment

victorvargass Oct 22, 2024

Tmkilduff
Nov 25, 2023

victorvargass
Oct 22, 2024