Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to reproduce the result #3

Closed
cocoshe opened this issue Sep 30, 2024 · 13 comments
Closed

Fail to reproduce the result #3

cocoshe opened this issue Sep 30, 2024 · 13 comments

Comments

@cocoshe
Copy link

cocoshe commented Sep 30, 2024

Thanks for your great work, I ft the model with the cmd in readme:

python -m torch.distributed.launch --nproc_per_node 1 --master_port 10010 --use_env train.py --freeze_text_encoder --with_box_refine --binary --dataset_file mevis --epochs 2 --lr_drop 1 --resume [MUTR checkpoint] --output_dir [output path] --mevis_path [MeViS path] --backbone swin_l_p4w7

specifically:

CUDA_VISIBLE_DEVICES=7,6,5,4 python -m torch.distributed.launch --nproc_per_node 4         --master_port 10010         --use_env train.py         --freeze_text_encoder         --with_box_refine       --binary         --dataset_file mevis         --epochs 2         --lr_drop 1         --resume "ckpt/swin_l_p4w7.pth"         --output_dir "./MUTR_ft_output"         --mevis_path "datasets/mevis"         --backbone swin_l_p4w7

Then infer the output on valid(online) and valid_u(offline)

I use the eval script in eval_mevis.py to calculate the offline metrics.

However, here's the result:

  1. Use the ckpt in your repo
J: 0.5256618287625423
F: 0.6128153525644362
J&F: 0.5692385906634893
time: 129.4594 s
  1. train the model by myself with the cmd above
J: 0.48942632887183013
F: 0.5868775221459913
J&F: 0.5381519255089107
time: 163.5233 s

BTW, the inference setting are the same.

Any idea on how to reproduce the ckpt result?

@gaomingqi
Copy link
Contributor

gaomingqi commented Oct 2, 2024

Hello, thanks for your message. Could you give the comparison between the reported and reproduced results on the valid (online) set? The training script seems ok. Can you also confirm that the ckpt used for evaluation is trained with the last epoch? Thanks! :D

@cocoshe
Copy link
Author

cocoshe commented Oct 2, 2024

Hello, thanks for your message. Could you also provide the comparison between the provided and reproduced results on the valid (online) set? The training script seems ok. Can you also confirm that the ckpt used for evaluation is trained with the last epoch? Thanks! :D

Thanks for your reply, I submit the results, which is infered on valid dataset, to the CodaLab evaluator here

img

img2

I use the cmd in readme for inference:

python inference_mevis.py --with_box_refine --binary --freeze_text_encoder --output_dir [output path] --resume [checkpoint path] --ngpu 1 --batch_size 1 --backbone swin_l_p4w7 --mevis_path [MeViS path] --split valid --sub_video_len 30 --no_sampling (optional, no sampling mode)

specially, I check my history cmd:

 1964  2024-10-01 13:01:18 CUDA_VISIBLE_DEVICES=7,6,5,4 python inference_mevis.py --with_box_refine --binary --freeze_text_encoder --output_dir output_no_freeze_backbone --resume MUTR_ft_output/checkpoint0001.pth --ngpu 4 --batch_size 1 --backbone swin_l_p4w7 --mevis_path datasets/mevis --split valid --sub_video_len 30 --no_sampling

The 0.473 is the online score of your ckpt, 0.4486 is the online score of my ckpt with backbone frozen(no need to pay attention to this). And 0.4591 is the online score of my ckpt following the instruction I mentioned before:

CUDA_VISIBLE_DEVICES=7,6,5,4 python -m torch.distributed.launch --nproc_per_node 4         --master_port 10010         --use_env train.py         --freeze_text_encoder         --with_box_refine       --binary         --dataset_file mevis         --epochs 2         --lr_drop 1         --resume "ckpt/swin_l_p4w7.pth"         --output_dir "./MUTR_ft_output"         --mevis_path "datasets/mevis"         --backbone swin_l_p4w7

Can you also confirm that the ckpt used for evaluation is trained with the last epoch?

About this, I checked my terminal history, the ckpt used for evaluation should be checkpoint0001.pth

Any idea on what's wrong?

@gaomingqi
Copy link
Contributor

I didn't see anything wrong and will train with the same script to reproduce the issue. BTW, may I ask if the visual backbone was frozen when training the model? Thanks:D

@cocoshe
Copy link
Author

cocoshe commented Oct 3, 2024

I didn't see anything wrong and will train with the same script to reproduce the issue. BTW, may I ask if the visual backbone was frozen when training the model? Thanks:D

the backbone is not freeze for the “output_no_freeze” zip, the “output_freeze” zip is just my own setting to check how the score will be influenced if the backbone is frozen, just my own experiment for a quick check since it takes 8 hours less then the “no_freeze” one, so no need to care about the frozen result.

Only need to pay attention to the “no freeze” one when reproducing the report result

@cocoshe
Copy link
Author

cocoshe commented Oct 9, 2024

I didn't see anything wrong and will train with the same script to reproduce the issue. BTW, may I ask if the visual backbone was frozen when training the model? Thanks:D

Hello! Any idea about what's going wrong? Or can you provide the specific training cmd setting? Since the --nproc_per_node 1 in your training seems not that reasonable, I think it should be finetuned with more than one device?

@gaomingqi
Copy link
Contributor

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.

As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

@cocoshe
Copy link
Author

cocoshe commented Oct 9, 2024

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.

As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

Thanks for your reply~
And I want to ask the training time for your single-gpu setting. I train it with 4 gpus with --nproc_per_node 4
instead of 1 gpu, and get the incorrect result (0.4591). So we use 1 and 4 gpus for training but get the same results, is that means the training numbers of gpus will not effect the results? It only effect the training time?

@cocoshe
Copy link
Author

cocoshe commented Oct 12, 2024

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.

As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

Hi~Any progress? BTW, I'm curious about the training time for fine-tuning, maybe it could take many hours or days with only one GPU device?

@gaomingqi
Copy link
Contributor

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.
As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

Hi~Any progress? BTW, I'm curious about the training time for fine-tuning, maybe it could take many hours or days with only one GPU device?

Hello, sorry for the delay. The training takes several days to complete.

Yes, increasing nproc_per_node makes training faster and has little effect on the performance.

The inconsistency between reported and reproduced performance lies in the parameter --freeze_text_encoder 🥲. It should be REMOVED from the training script. I have updated the code to correct this.

Thank you so much for bringing this issue to my attention and for your patience! 🙌

@cocoshe
Copy link
Author

cocoshe commented Oct 15, 2024

I got the same incorrect results (0.45) with the provided script, and I conjecture the reason lies in the training epochs and lr_drops (it might be epochs=3 and lr_drop=2, but the provided ones are 2 and 1). I am training with this setting and will let you know this weekend.
As for nproc_per_node=1, this is for single-gpu training and should be the same as the number of gpus in your experiment. Thanks! :D

Hi~Any progress? BTW, I'm curious about the training time for fine-tuning, maybe it could take many hours or days with only one GPU device?

Hello, sorry for the delay. The training takes several days to complete.

Yes, increasing nproc_per_node makes training faster and has little effect on the performance.

The inconsistency between reported and reproduced performance lies in the parameter --freeze_text_encoder 🥲. It should be REMOVED from the training script. I have updated the code to correct this.

Thank you so much for bringing this issue to my attention and for your patience! 🙌

Thanks for your patience! I will try it these days~

@cocoshe
Copy link
Author

cocoshe commented Oct 18, 2024

Hi, I just try to reproduce it with the new script, and just remove the --freeze_text_encoder with 4 devices.
And maybe there are still some differences during inference, here are the result with --no_sampling (following your readme), and the result without --no_sampling:

xx

xxx

There's still some margin between my results and the report.

However, the score is 0.47446 now, +0.02 more than the online score without training the text encoder !

I don't know if it is the device or the inferring command that leads to the margin.

@gaomingqi
Copy link
Contributor

Hello, thanks for your feedback.

The num of devices indeed leads to the margin since I got 0.4806801794 (w/ --no_sampling, sub_video_len=30) and 0.4877255771 (w/o --no_sampling, sub_video_len=30) in the last reproduction (with 8 GPUs).

Screenshot 2024-10-19 at 05 25 31

I am not sure how many devices were used in the challenge (it may have been 7), but I will remind users about this in the README.

I think more training devices (each device processes one training video, and the gradients on all devices are aggregated for the model update) might be considered as more training batches and these bring better generalisation.

Thanks! :D

@cocoshe
Copy link
Author

cocoshe commented Oct 18, 2024

Hello, thanks for your feedback.

The num of devices indeed leads to the margin since I got 0.4806801794 (w/ --no_sampling, sub_video_len=30) and 0.4877255771 (w/o --no_sampling, sub_video_len=30) in the last reproduction (with 8 GPUs).

Screenshot 2024-10-19 at 05 25 31 I am not sure how many devices were used in the challenge (it may have been 7), but I will remind users about this in the README.

I think more training devices (each device processes one training video, and the gradients on all devices are aggregated for the model update) might be considered as more training batches and these bring better generalisation.

Thanks! :D

OK, thanks for your timely reply and sincere assistance, that really helps a lot!

@cocoshe cocoshe closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants