-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeout #4400
Comments
@alicera 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
In addition to the above requirements, for Ultralytics to provide assistance your code should be:
If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
pytorch 21.07 command
|
@alicera you should use multi-GPU in counts of 2, 4, or 8, it's not recommended to use odd number processes. |
Just tested latest code on 3x RTX Titan and it works great
|
@iceisfun thanks for the feedback! |
I try the 2 GPU
|
Do you know the reason that it's not recommended to use odd number processes? |
@alicera Have you tried each device as a single gpu in training to ensure they are all working? Also checking the nvidia-smi and dmesg output can reveal hardware issues |
I find the main problem that I use the "odd gpus" and "--hyp data/hyps/hyp.finetune.yaml " . It is easy to fail. But I use the "even gpus" and "--hyp data/hyps/hyp.scratch.yaml ", it work until 156 epoch. |
I've tested --hyp data/hyps/hyp.scratch.yaml and --hyp data/hyps/hyp.finetune.yaml with 3x RTX Titan and it works fine. Could you to test each GPU individually, make sure each single device you can train the model. |
@alicera @iceisfun good news 😃! Your original issue may now be fixed ✅ in PR #4422. This PR updates the DDP process group, and was verified over 3 epochs of COCO training with 4x A100 DDP NCCL on EC2 P4d instance with official Docker image and CUDA 11.1 pip install from https://pytorch.org/get-started/locally/ d=yolov5 && git clone https://github.com/ultralytics/yolov5 -b master $d && cd $d
python -m torch.distributed.launch --nproc_per_node 4 --master_port 1 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 0,1,2,3
python -m torch.distributed.launch --nproc_per_node 4 --master_port 2 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 4,5,6,7 To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
When generate the train.cache, it will happen the status.
The text was updated successfully, but these errors were encountered: