nan loss while training #19

Park-ing-lot · 2021-08-12T15:06:34Z

I use
CUDA_VISIBLE_DEVICES=2 python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_101_FPN_1x.yaml" --skip-test SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1 MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000

this command to follow your instruction and I use coco 2017 train and val data.

While training, the loss keeps around 8 and did not drop.
after 6000 steps, the model spits nan loss.

do you have any idea why nan loss is coming?
What is the problem?

ArghyaPal · 2021-10-11T17:53:45Z

I and my partner had got the same problem. We tried to train the R101 network after uncommenting rpn. It is working (our present iteration number is 31K+). We agree that it is different from the CVPR VCRCNN paper's training method. We think the backbone would not be trained well after removing RPN. We may be wrong.

Request @Wangt-CN comment in this regard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan loss while training #19

nan loss while training #19

Park-ing-lot commented Aug 12, 2021

ArghyaPal commented Oct 11, 2021

nan loss while training #19

nan loss while training #19

Comments

Park-ing-lot commented Aug 12, 2021

ArghyaPal commented Oct 11, 2021