Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan loss while training #19

Open
Park-ing-lot opened this issue Aug 12, 2021 · 1 comment
Open

nan loss while training #19

Park-ing-lot opened this issue Aug 12, 2021 · 1 comment

Comments

@Park-ing-lot
Copy link

I use
CUDA_VISIBLE_DEVICES=2 python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_101_FPN_1x.yaml" --skip-test SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1 MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000

this command to follow your instruction and I use coco 2017 train and val data.

While training, the loss keeps around 8 and did not drop.
after 6000 steps, the model spits nan loss.

do you have any idea why nan loss is coming?
What is the problem?

@ArghyaPal
Copy link

I and my partner had got the same problem. We tried to train the R101 network after uncommenting rpn. It is working (our present iteration number is 31K+). We agree that it is different from the CVPR VCRCNN paper's training method. We think the backbone would not be trained well after removing RPN. We may be wrong.

Request @Wangt-CN comment in this regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants