Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout #4400

Closed
alicera opened this issue Aug 12, 2021 · 11 comments · Fixed by #4422
Closed

timeout #4400

alicera opened this issue Aug 12, 2021 · 11 comments · Fixed by #4422
Labels
bug Something isn't working

Comments

@alicera
Copy link

alicera commented Aug 12, 2021

When generate the train.cache, it will happen the status.

Transferred 498/506 items from yolov5m.pt
Scaled weight_decay = 0.0005625000000000001
optimizer: SGD with parameter groups 83 weight, 86 weight (no decay), 86 bias
train: Scanning '../coco/labels/train' images and labels...43718 found, 110 missing, 0 empty, 7 corrupted:  97%|██▉| 43830/45049 [01:07<00:18, 66.12it/s][E ProcessGroupNCCL.cpp:567] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67438 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:567] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67434 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:567] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67441 milliseconds before timing out.
train: Scanning '../coco/labels/train' images and labels...43731 found, 110 missing, 0 empty, 7 corrupted:  97%|██▉| 43843/45049 [01:07<00:16, 74.98it/s][E ProcessGroupNCCL.cpp:327] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67438 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:327] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67434 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:327] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67441 milliseconds before timing out.
train: Scanning '../coco/labels/train' images and labels...43944 found, 110 missing, 0 empty, 8 corrupted:  98%|██▉| 44057/45049 [01:09<00:13, 72.27it/s]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 2869) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_555f7od4/none_6c5y8_04/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_555f7od4/none_6c5y8_04/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_555f7od4/none_6c5y8_04/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_555f7od4/none_6c5y8_04/attempt_1/3/error.json
Process ForkPoolWorker-2:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-3:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-6:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-4:
Traceback (most recent call last):
  File "/work/Pictures/YOLOv5/new_yolov5_0804/yolov5/utils/datasets.py", line 401, in __init__
    cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
  File "/opt/conda/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '../coco/labels/train.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
train: weights=yolov5m.pt, cfg=models/yolov5m.yaml, data=data/oblique.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=36, imgsz=832, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-346-g771ac6c torch 1.10.0a0+ecc3718 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11178.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 596, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 596, in <module>
Traceback (most recent call last):
  File "train.py", line 596, in <module>
  File "train.py", line 596, in <module>
    main(opt)
  File "train.py", line 491, in main
        main(opt)    main(opt)
main(opt)

  File "train.py", line 491, in main
  File "train.py", line 491, in main
  File "train.py", line 491, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
            dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))


  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
            _store_based_barrier(rank, store, timeout)_store_based_barrier(rank, store, timeout)_store_based_barrier(rank, store, timeout)


  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
        raise RuntimeError(    raise RuntimeError(
raise RuntimeError(

RuntimeErrorRuntimeErrorRuntimeError: : Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00): Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
@alicera alicera added the bug Something isn't working label Aug 12, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 12, 2021

@alicera 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible that still produces the same problem
  • Complete – Provide all parts someone else needs to reproduce your problem in the question itself
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

  • Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
  • Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@alicera
Copy link
Author

alicera commented Aug 14, 2021

pytorch 21.07
Driver Version: 470.57.02
GeForce 1070

command
8/14

git clone https://github.com/ultralytics/yolov5
python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 36 --data data/fine.yaml --img-size 832 --cfg models/yolov5m.yaml --weights yolov5m.pt --hyp data/hyps/hyp.finetune.yaml 

@glenn-jocher
Copy link
Member

@alicera you should use multi-GPU in counts of 2, 4, or 8, it's not recommended to use odd number processes.

@iceisfun
Copy link

iceisfun commented Aug 14, 2021

Just tested latest code on 3x RTX Titan and it works great

python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 9

@glenn-jocher
Copy link
Member

@iceisfun thanks for the feedback!

@alicera
Copy link
Author

alicera commented Aug 14, 2021

I try the 2 GPU

Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
     0/299     6.52G   0.06442   0.02946   0.01389       259       832:   1%|▋                                                          | 1/89 [00:08<12:49,  8.74s/it]Reducer buckets have been rebuilt in this iteration.
     0/299     6.43G    0.0625   0.02412   0.01394        33       832: 100%|██████████████████████████████████████████████████████████| 89/89 [01:31<00:00,  1.03s/it]
               Class     Images     Labels          P          R     [email protected] [email protected]:.95:  96%|████████████████████████████████████████  | 85/89 [01:07<00:05,  1.42s/it][E ProcessGroupNCCL.cpp:567] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 69479 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:327] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 69479 milliseconds before timing out.
               Class     Images     Labels          P          R     [email protected] [email protected]:.95:  97%|████████████████████████████████████████▌ | 86/89 [01:10<00:05,  1.94s/it]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 22929) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_u5rgk6zg/none_fly5kges/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_u5rgk6zg/none_fly5kges/attempt_1/1/error.json
train: weights=yolov5m.pt, cfg=models/yolov5m.yaml, data=data/fine.yaml, hyp=data/hyps/hyp.finetune.yaml, epochs=300, batch_size=24, imgsz=832, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-363-g63e09fd torch 1.10.0a0+ecc3718 CUDA:0 (NVIDIA GeForce GTX 1070, 8119.5625MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 591, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 241, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:01:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23531) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=2
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_u5rgk6zg/none_fly5kges/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_u5rgk6zg/none_fly5kges/attempt_2/1/error.json
train: weights=yolov5m.pt, cfg=models/yolov5m.yaml, data=data/fine.yaml, hyp=data/hyps/hyp.finetune.yaml, epochs=300, batch_size=24, imgsz=832, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-363-g63e09fd torch 1.10.0a0+ecc3718 CUDA:0 (NVIDIA GeForce GTX 1070, 8119.5625MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)    
main(opt)
  File "train.py", line 494, in main
  File "train.py", line 494, in main

@alicera
Copy link
Author

alicera commented Aug 14, 2021

Do you know the reason that it's not recommended to use odd number processes?

@iceisfun
Copy link

@alicera Have you tried each device as a single gpu in training to ensure they are all working?

Also checking the nvidia-smi and dmesg output can reveal hardware issues

@alicera
Copy link
Author

alicera commented Aug 15, 2021

I find the main problem that I use the "odd gpus" and "--hyp data/hyps/hyp.finetune.yaml " . It is easy to fail.

But I use the "even gpus" and "--hyp data/hyps/hyp.scratch.yaml ", it work until 156 epoch.
It fail when 156 epoch.

@iceisfun
Copy link

I've tested --hyp data/hyps/hyp.scratch.yaml and --hyp data/hyps/hyp.finetune.yaml with 3x RTX Titan and it works fine.

Could you to test each GPU individually, make sure each single device you can train the model.

@glenn-jocher glenn-jocher linked a pull request Aug 15, 2021 that will close this issue
@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 15, 2021

@alicera @iceisfun good news 😃! Your original issue may now be fixed ✅ in PR #4422.

This PR updates the DDP process group, and was verified over 3 epochs of COCO training with 4x A100 DDP NCCL on EC2 P4d instance with official Docker image and CUDA 11.1 pip install from https://pytorch.org/get-started/locally/

d=yolov5 && git clone https://github.com/ultralytics/yolov5 -b master $d && cd $d

python -m torch.distributed.launch --nproc_per_node 4 --master_port 1 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 0,1,2,3
python -m torch.distributed.launch --nproc_per_node 4 --master_port 2 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 4,5,6,7

To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants