Fail to reproduce the result #3

cocoshe · 2024-09-30T05:45:36Z

Thanks for your great work, I ft the model with the cmd in readme:

python -m torch.distributed.launch --nproc_per_node 1 --master_port 10010 --use_env train.py --freeze_text_encoder --with_box_refine --binary --dataset_file mevis --epochs 2 --lr_drop 1 --resume [MUTR checkpoint] --output_dir [output path] --mevis_path [MeViS path] --backbone swin_l_p4w7

specifically:

CUDA_VISIBLE_DEVICES=7,6,5,4 python -m torch.distributed.launch --nproc_per_node 4         --master_port 10010         --use_env train.py         --freeze_text_encoder         --with_box_refine       --binary         --dataset_file mevis         --epochs 2         --lr_drop 1         --resume "ckpt/swin_l_p4w7.pth"         --output_dir "./MUTR_ft_output"         --mevis_path "datasets/mevis"         --backbone swin_l_p4w7

Then infer the output on valid(online) and valid_u(offline)

I use the eval script in eval_mevis.py to calculate the offline metrics.

However, here's the result:

Use the ckpt in your repo

J: 0.5256618287625423
F: 0.6128153525644362
J&F: 0.5692385906634893
time: 129.4594 s

train the model by myself with the cmd above

J: 0.48942632887183013
F: 0.5868775221459913
J&F: 0.5381519255089107
time: 163.5233 s

BTW, the inference setting are the same.

Any idea on how to reproduce the ckpt result?

The text was updated successfully, but these errors were encountered:

gaomingqi · 2024-10-02T09:34:28Z

Hello, thanks for your message. Could you give the comparison between the reported and reproduced results on the valid (online) set? The training script seems ok. Can you also confirm that the ckpt used for evaluation is trained with the last epoch? Thanks! :D

cocoshe · 2024-10-02T12:30:50Z

Hello, thanks for your message. Could you also provide the comparison between the provided and reproduced results on the valid (online) set? The training script seems ok. Can you also confirm that the ckpt used for evaluation is trained with the last epoch? Thanks! :D

Thanks for your reply, I submit the results, which is infered on valid dataset, to the CodaLab evaluator here

I use the cmd in readme for inference:

python inference_mevis.py --with_box_refine --binary --freeze_text_encoder --output_dir [output path] --resume [checkpoint path] --ngpu 1 --batch_size 1 --backbone swin_l_p4w7 --mevis_path [MeViS path] --split valid --sub_video_len 30 --no_sampling (optional, no sampling mode)

specially, I check my history cmd:

 1964  2024-10-01 13:01:18 CUDA_VISIBLE_DEVICES=7,6,5,4 python inference_mevis.py --with_box_refine --binary --freeze_text_encoder --output_dir output_no_freeze_backbone --resume MUTR_ft_output/checkpoint0001.pth --ngpu 4 --batch_size 1 --backbone swin_l_p4w7 --mevis_path datasets/mevis --split valid --sub_video_len 30 --no_sampling

The 0.473 is the online score of your ckpt, 0.4486 is the online score of my ckpt with backbone frozen(no need to pay attention to this). And 0.4591 is the online score of my ckpt following the instruction I mentioned before:

CUDA_VISIBLE_DEVICES=7,6,5,4 python -m torch.distributed.launch --nproc_per_node 4         --master_port 10010         --use_env train.py         --freeze_text_encoder         --with_box_refine       --binary         --dataset_file mevis         --epochs 2         --lr_drop 1         --resume "ckpt/swin_l_p4w7.pth"         --output_dir "./MUTR_ft_output"         --mevis_path "datasets/mevis"         --backbone swin_l_p4w7

Can you also confirm that the ckpt used for evaluation is trained with the last epoch?

About this, I checked my terminal history, the ckpt used for evaluation should be checkpoint0001.pth

Any idea on what's wrong?

gaomingqi · 2024-10-03T00:31:23Z

I didn't see anything wrong and will train with the same script to reproduce the issue. BTW, may I ask if the visual backbone was frozen when training the model? Thanks:D

cocoshe · 2024-10-03T06:08:15Z

I didn't see anything wrong and will train with the same script to reproduce the issue. BTW, may I ask if the visual backbone was frozen when training the model? Thanks:D

the backbone is not freeze for the “output_no_freeze” zip, the “output_freeze” zip is just my own setting to check how the score will be influenced if the backbone is frozen, just my own experiment for a quick check since it takes 8 hours less then the “no_freeze” one, so no need to care about the frozen result.

Only need to pay attention to the “no freeze” one when reproducing the report result

cocoshe · 2024-10-09T05:01:38Z

I didn't see anything wrong and will train with the same script to reproduce the issue. BTW, may I ask if the visual backbone was frozen when training the model? Thanks:D

Hello! Any idea about what's going wrong? Or can you provide the specific training cmd setting? Since the --nproc_per_node 1 in your training seems not that reasonable, I think it should be finetuned with more than one device?

gaomingqi · 2024-10-09T05:56:55Z

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.

As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

cocoshe · 2024-10-09T06:05:54Z

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.

As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

Thanks for your reply~
And I want to ask the training time for your single-gpu setting. I train it with 4 gpus with --nproc_per_node 4
instead of 1 gpu, and get the incorrect result (0.4591). So we use 1 and 4 gpus for training but get the same results, is that means the training numbers of gpus will not effect the results? It only effect the training time?

cocoshe · 2024-10-12T15:55:43Z

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.

As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

Hi～Any progress? BTW, I'm curious about the training time for fine-tuning, maybe it could take many hours or days with only one GPU device?

gaomingqi · 2024-10-15T05:11:46Z

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.
As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

Hi～Any progress? BTW, I'm curious about the training time for fine-tuning, maybe it could take many hours or days with only one GPU device?

Hello, sorry for the delay. The training takes several days to complete.

Yes, increasing nproc_per_node makes training faster and has little effect on the performance.

The inconsistency between reported and reproduced performance lies in the parameter --freeze_text_encoder 🥲. It should be REMOVED from the training script. I have updated the code to correct this.

Thank you so much for bringing this issue to my attention and for your patience! 🙌

cocoshe · 2024-10-15T06:45:16Z

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.
As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

Hi～Any progress? BTW, I'm curious about the training time for fine-tuning, maybe it could take many hours or days with only one GPU device?

Hello, sorry for the delay. The training takes several days to complete.

Yes, increasing nproc_per_node makes training faster and has little effect on the performance.

The inconsistency between reported and reproduced performance lies in the parameter --freeze_text_encoder 🥲. It should be REMOVED from the training script. I have updated the code to correct this.

Thank you so much for bringing this issue to my attention and for your patience! 🙌

Thanks for your patience! I will try it these days~

cocoshe · 2024-10-18T16:52:27Z

Hi, I just try to reproduce it with the new script, and just remove the --freeze_text_encoder with 4 devices.
And maybe there are still some differences during inference, here are the result with --no_sampling (following your readme), and the result without --no_sampling:

There's still some margin between my results and the report.

However, the score is 0.47446 now, +0.02 more than the online score without training the text encoder !

I don't know if it is the device or the inferring command that leads to the margin.

gaomingqi · 2024-10-18T21:31:47Z

Hello, thanks for your feedback.

The num of devices indeed leads to the margin since I got 0.4806801794 (w/ --no_sampling, sub_video_len=30) and 0.4877255771 (w/o --no_sampling, sub_video_len=30) in the last reproduction (with 8 GPUs).

I am not sure how many devices were used in the challenge (it may have been 7), but I will remind users about this in the README.

I think more training devices (each device processes one training video, and the gradients on all devices are aggregated for the model update) might be considered as more training batches and these bring better generalisation.

Thanks! :D

cocoshe · 2024-10-18T21:41:05Z

Hello, thanks for your feedback.

The num of devices indeed leads to the margin since I got 0.4806801794 (w/ --no_sampling, sub_video_len=30) and 0.4877255771 (w/o --no_sampling, sub_video_len=30) in the last reproduction (with 8 GPUs).
I am not sure how many devices were used in the challenge (it may have been 7), but I will remind users about this in the README.
I think more training devices (each device processes one training video, and the gradients on all devices are aggregated for the model update) might be considered as more training batches and these bring better generalisation.

Thanks! :D

OK, thanks for your timely reply and sincere assistance, that really helps a lot!

cocoshe closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to reproduce the result #3

Fail to reproduce the result #3

cocoshe commented Sep 30, 2024

gaomingqi commented Oct 2, 2024 •

edited

Loading

cocoshe commented Oct 2, 2024 •

edited

Loading

gaomingqi commented Oct 3, 2024

cocoshe commented Oct 3, 2024

cocoshe commented Oct 9, 2024

gaomingqi commented Oct 9, 2024

cocoshe commented Oct 9, 2024

cocoshe commented Oct 12, 2024

gaomingqi commented Oct 15, 2024

cocoshe commented Oct 15, 2024

cocoshe commented Oct 18, 2024 •

edited

Loading

gaomingqi commented Oct 18, 2024

cocoshe commented Oct 18, 2024

Fail to reproduce the result #3

Fail to reproduce the result #3

Comments

cocoshe commented Sep 30, 2024

gaomingqi commented Oct 2, 2024 • edited Loading

cocoshe commented Oct 2, 2024 • edited Loading

gaomingqi commented Oct 3, 2024

cocoshe commented Oct 3, 2024

cocoshe commented Oct 9, 2024

gaomingqi commented Oct 9, 2024

cocoshe commented Oct 9, 2024

cocoshe commented Oct 12, 2024

gaomingqi commented Oct 15, 2024

cocoshe commented Oct 15, 2024

cocoshe commented Oct 18, 2024 • edited Loading

gaomingqi commented Oct 18, 2024

cocoshe commented Oct 18, 2024

gaomingqi commented Oct 2, 2024 •

edited

Loading

cocoshe commented Oct 2, 2024 •

edited

Loading

cocoshe commented Oct 18, 2024 •

edited

Loading