timeout #4400

alicera · 2021-08-12T14:32:34Z

When generate the train.cache, it will happen the status.

Transferred 498/506 items from yolov5m.pt
Scaled weight_decay = 0.0005625000000000001
optimizer: SGD with parameter groups 83 weight, 86 weight (no decay), 86 bias
train: Scanning '../coco/labels/train' images and labels...43718 found, 110 missing, 0 empty, 7 corrupted:  97%|██▉| 43830/45049 [01:07<00:18, 66.12it/s][E ProcessGroupNCCL.cpp:567] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67438 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:567] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67434 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:567] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67441 milliseconds before timing out.
train: Scanning '../coco/labels/train' images and labels...43731 found, 110 missing, 0 empty, 7 corrupted:  97%|██▉| 43843/45049 [01:07<00:16, 74.98it/s][E ProcessGroupNCCL.cpp:327] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67438 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:327] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67434 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:327] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67441 milliseconds before timing out.
train: Scanning '../coco/labels/train' images and labels...43944 found, 110 missing, 0 empty, 8 corrupted:  98%|██▉| 44057/45049 [01:09<00:13, 72.27it/s]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 2869) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_555f7od4/none_6c5y8_04/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_555f7od4/none_6c5y8_04/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_555f7od4/none_6c5y8_04/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_555f7od4/none_6c5y8_04/attempt_1/3/error.json
Process ForkPoolWorker-2:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-3:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-6:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-4:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
train: weights=yolov5m.pt, cfg=models/yolov5m.yaml, data=data/oblique.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=36, imgsz=832, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-346-g771ac6c torch 1.10.0a0+ecc3718 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11178.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 596, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 596, in <module>
Traceback (most recent call last):
  File "train.py", line 596, in <module>
  File "train.py", line 596, in <module>
    main(opt)
  File "train.py", line 491, in main
        main(opt)    main(opt)
main(opt)

  File "train.py", line 491, in main
  File "train.py", line 491, in main
  File "train.py", line 491, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
            dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))


  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
            _store_based_barrier(rank, store, timeout)_store_based_barrier(rank, store, timeout)_store_based_barrier(rank, store, timeout)


  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
        raise RuntimeError(    raise RuntimeError(
raise RuntimeError(

RuntimeErrorRuntimeErrorRuntimeError: : Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00): Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2021-08-12T14:44:56Z

@alicera 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible that still produces the same problem
✅ Complete – Provide all parts someone else needs to reproduce your problem in the question itself
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

✅ Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
✅ Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

alicera · 2021-08-14T13:27:17Z

pytorch 21.07
Driver Version: 470.57.02
GeForce 1070

command
8/14

git clone https://github.com/ultralytics/yolov5
python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 36 --data data/fine.yaml --img-size 832 --cfg models/yolov5m.yaml --weights yolov5m.pt --hyp data/hyps/hyp.finetune.yaml

glenn-jocher · 2021-08-14T13:46:15Z

@alicera you should use multi-GPU in counts of 2, 4, or 8, it's not recommended to use odd number processes.

iceisfun · 2021-08-14T13:48:15Z

Just tested latest code on 3x RTX Titan and it works great

python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 9

glenn-jocher · 2021-08-14T13:51:46Z

@iceisfun thanks for the feedback!

alicera · 2021-08-14T14:04:09Z

I try the 2 GPU

Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
     0/299     6.52G   0.06442   0.02946   0.01389       259       832:   1%|▋                                                          | 1/89 [00:08<12:49,  8.74s/it]Reducer buckets have been rebuilt in this iteration.
     0/299     6.43G    0.0625   0.02412   0.01394        33       832: 100%|██████████████████████████████████████████████████████████| 89/89 [01:31<00:00,  1.03s/it]
               Class     Images     Labels          P          R     [email protected] [email protected]:.95:  96%|████████████████████████████████████████  | 85/89 [01:07<00:05,  1.42s/it][E ProcessGroupNCCL.cpp:567] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 69479 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:327] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 69479 milliseconds before timing out.
               Class     Images     Labels          P          R     [email protected] [email protected]:.95:  97%|████████████████████████████████████████▌ | 86/89 [01:10<00:05,  1.94s/it]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 22929) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_u5rgk6zg/none_fly5kges/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_u5rgk6zg/none_fly5kges/attempt_1/1/error.json
train: weights=yolov5m.pt, cfg=models/yolov5m.yaml, data=data/fine.yaml, hyp=data/hyps/hyp.finetune.yaml, epochs=300, batch_size=24, imgsz=832, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-363-g63e09fd torch 1.10.0a0+ecc3718 CUDA:0 (NVIDIA GeForce GTX 1070, 8119.5625MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23531) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=2
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_u5rgk6zg/none_fly5kges/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_u5rgk6zg/none_fly5kges/attempt_2/1/error.json
train: weights=yolov5m.pt, cfg=models/yolov5m.yaml, data=data/fine.yaml, hyp=data/hyps/hyp.finetune.yaml, epochs=300, batch_size=24, imgsz=832, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-363-g63e09fd torch 1.10.0a0+ecc3718 CUDA:0 (NVIDIA GeForce GTX 1070, 8119.5625MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)    
main(opt)
  File "train.py", line 494, in main
  File "train.py", line 494, in main

alicera · 2021-08-14T14:24:10Z

Do you know the reason that it's not recommended to use odd number processes？

iceisfun · 2021-08-14T21:52:37Z

@alicera Have you tried each device as a single gpu in training to ensure they are all working?

Also checking the nvidia-smi and dmesg output can reveal hardware issues

alicera · 2021-08-15T06:10:06Z

I find the main problem that I use the "odd gpus" and "--hyp data/hyps/hyp.finetune.yaml " . It is easy to fail.

But I use the "even gpus" and "--hyp data/hyps/hyp.scratch.yaml ", it work until 156 epoch.
It fail when 156 epoch.

iceisfun · 2021-08-15T10:10:30Z

I've tested --hyp data/hyps/hyp.scratch.yaml and --hyp data/hyps/hyp.finetune.yaml with 3x RTX Titan and it works fine.

Could you to test each GPU individually, make sure each single device you can train the model.

glenn-jocher · 2021-08-15T16:34:50Z

@alicera @iceisfun good news 😃! Your original issue may now be fixed ✅ in PR #4422.

This PR updates the DDP process group, and was verified over 3 epochs of COCO training with 4x A100 DDP NCCL on EC2 P4d instance with official Docker image and CUDA 11.1 pip install from https://pytorch.org/get-started/locally/

d=yolov5 && git clone https://github.com/ultralytics/yolov5 -b master $d && cd $d

python -m torch.distributed.launch --nproc_per_node 4 --master_port 1 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 0,1,2,3
python -m torch.distributed.launch --nproc_per_node 4 --master_port 2 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 4,5,6,7

To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

alicera added the bug Something isn't working label Aug 12, 2021

glenn-jocher linked a pull request Aug 15, 2021 that will close this issue

Remove DDP process group timeout #4422

Merged

glenn-jocher closed this as completed in #4422 Aug 15, 2021

LegendSun0 mentioned this issue Apr 19, 2022

NCCL timeout problem on DPP #7481

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeout #4400

timeout #4400

alicera commented Aug 12, 2021

glenn-jocher commented Aug 12, 2021 •

edited

Loading

alicera commented Aug 14, 2021

glenn-jocher commented Aug 14, 2021

iceisfun commented Aug 14, 2021 •

edited

Loading

glenn-jocher commented Aug 14, 2021

alicera commented Aug 14, 2021

alicera commented Aug 14, 2021

iceisfun commented Aug 14, 2021

alicera commented Aug 15, 2021 •

edited

Loading

iceisfun commented Aug 15, 2021

glenn-jocher commented Aug 15, 2021 •

edited by UltralyticsAssistant

Loading

timeout #4400

timeout #4400

Comments

alicera commented Aug 12, 2021

glenn-jocher commented Aug 12, 2021 • edited Loading

How to create a Minimal, Reproducible Example

alicera commented Aug 14, 2021

glenn-jocher commented Aug 14, 2021

iceisfun commented Aug 14, 2021 • edited Loading

glenn-jocher commented Aug 14, 2021

alicera commented Aug 14, 2021

alicera commented Aug 14, 2021

iceisfun commented Aug 14, 2021

alicera commented Aug 15, 2021 • edited Loading

iceisfun commented Aug 15, 2021

glenn-jocher commented Aug 15, 2021 • edited by UltralyticsAssistant Loading

glenn-jocher commented Aug 12, 2021 •

edited

Loading

iceisfun commented Aug 14, 2021 •

edited

Loading

alicera commented Aug 15, 2021 •

edited

Loading

glenn-jocher commented Aug 15, 2021 •

edited by UltralyticsAssistant

Loading