Incorrect FID-VID and FVD #25

Fanghaipeng · 2023-07-28T06:36:48Z

Thanks for great work. @Wangt-CN

I tried to reproduce the results using "gen_eval.sh," but I noticed that the FID-VID and FVD do not match the results reported in the paper. Can you help me with this issue? Is it possible that I am using the incorrect checkpoints?

download checkpoints:
pth ： TikTok Training Data (FID-FVD: 18.8)

FID-VID：resnet-50-kinetics.pth : "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-50-kinetics.pth"

FVD: i3d_pretrained_400.pt : "https://drive.google.com/file/d/1mQK8KD8G6UWRa5t87SRMm5PVXtlpneJT/edit"

zhangtao22 · 2023-07-28T08:57:12Z

which checkpoint did you use to evaluate？

Fanghaipeng · 2023-07-28T09:34:13Z

this one:https://storage.googleapis.com/disco-checkpoint-share/checkpoint_ft/tiktok_cfg/mp_rank_00_model_states.pt @zhangtao22

zhangtao22 · 2023-07-30T02:09:10Z

@Fanghaipeng 谢谢

zhangtao22 · 2023-08-01T03:13:39Z

exp_folder=$1
pred_folder="${2:-${exp_folder}/pred_gs3.0_scale-cond1.0-ref1.0}"
gt_folder=${3:-${exp_folder}/gt}
what did assign these parameters when you executed gen_eval.sh?How do them correspond to that checkpoint?

fwbx529 · 2023-08-08T13:32:18Z

Hi, I have a similar problem.
I run the evaluation using this script (with More TikTok-Style Training Data (FID-FVD: 15.7)):

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py --eval_visu --root_dir run_test --local_train_batch_size 32 --local_eval_batch_size 32 --log_dir exp/tiktok_ft --epochs 20 --deepspeed --eval_step 500 --save_step 500 --gradient_accumulate_steps 1 --learning_rate 2e-4 --fix_dist_seed --loss_target "noise" --train_yaml /root/autodl-tmp/DisCo/data/composite_offset/train_TiktokDance-poses-masks.yaml --val_yaml /root/autodl-tmp/DisCo/data/composite_offset/new10val_TiktokDance-poses-masks.yaml --unet_unfreeze_type "all" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "poses" "masks" --eval_save_filename outputs/ --guidance_scale 1.5 --pretrained_model /root/autodl-tmp/DisCo/checkpoints/moretiktok_cfg/mp_rank_00_model_states.pt
and perform the fvd calculation using sh gen_eval.sh run_test/exp/tiktok_ft/outputs
exp_folder=$1 pred_folder="${2:-${exp_folder}/pred_gs1.5_scale-cond1.0-ref1.0}" gt_folder=${3:-${exp_folder}/gt}
the outputs are:
{"FID": 28.28052071476202} {"FVD-3DRN50": 20.343344017827633, "FVD-3DInception": 496.9720583513131} {"L1": 0.00036944843636224373, "SSIM": 0.6731481792654774, "LPIPS": 0.2868661347549863, "PSNR": 29.18716913133768, "clean-fid": 28.280520714758126}

FID/L1/SSIM/LPIPS/PSNR are similar, but FVD-3DRN50 (FID-VID) and FVD-3DInception (FVD) are different. (DISCO † (w/. HAP, CFG))

Also, let me attach some gifs generated for FVD, to check whether the generated results are correct.

The code run evaluation on 337/338/201/202/203

fwbx529 · 2023-08-08T16:28:42Z

BTW, I found that the gif generation using imageio at 3fps (gen_eval.sh) will cause tbr as 24.25, so when ffmpeg change it to video, the origin 16 frame gif will turn into 128 frame video.

I try to change the 3fps to 25fps in gen_eval.sh, the results become more wierd: {"FVD-3DRN50": 96.4811157462753, "FVD-3DInception": 382.5581035752218}
I kind of not understand sample_duration=16 when dealing with 16-frame gif, does it mean...nothing? Just use the total 16 frame-video for evaluation? Or the correct version should be the original 128-frame video and split it into 8?

ps: the above results are generated using pytorch 2.0 (with some code change for loading ckpt) and other newer packages. When I reproduce the exact pip package versions in ReadMe, the '25fps' FVD-3DRN50 is 96.15072454699538, and other results are similar, as stated in #36

linziqu · 2023-08-21T08:26:39Z

I meet the similar issue!
I try this script (with More TikTok-Style Training Data (FID-FVD: 15.7)). But I cannot get the satisfactory results in the paper!
How to generate the similiar videos provided in the project?

Delicious-Bitter-Melon · 2023-08-29T03:31:16Z

I can not aslo reproduce the results using "gen_eval.sh" by FID-VID: 18.86 model provided by the official implementation. When using the default guidance scale 3.0, my result is {'FVD-3DRN50': 21.664065154647858, 'FVD-3DInception': 567.6111442260626}. And, using the optimal guidance scale 1.5 as reported in the paper, my result is {'FVD-3DRN50': 23.933738779128873, 'FVD-3DInception': 564.9114347158875} compared to the paper result FID-VID 18.86 and FVD 393.34.

Wangt-CN · 2023-09-12T20:37:35Z

Dear all, so sorry for the delay since I cannot achieve the computing resources for this project after the end of my internship in July. A few days ago, I successfully got the temporal access and I try to revisit this codebase. I used a totally new env to make sure that this codebase can be reproduced under most situations.

For Image evaluation metric, it seemed that there is no confusion (btw, for FID, we use the pytorch-FID package to report the results).
For Video metric, first of all, I can successfully reproduce our paper results (I use DisCo but not DisCo+ model and I will further verify the DisCo+ model) [@Fanghaipeng @linziqu @Yongssss]:

Metrics	FID-FVD	FVD
Paper	18.86	393.34
Reproduce	19.28	385.93

And we use this resnet-kinetics and i3d checkpoint model (under eval_fvd). I think the different results may due to the different checkpoint model which I forgot to sync from the corporation storage.

Moreover, after checking @fwbx529's comments (thanks @fwbx529 !), we found current gif generation process is indeed sub-optimal and different video format may cause totally different results. (But note that we used this computing script for all the models so it is still fair). To solve this possible issue, we revise the code and follow mvcd to directly use frames (but not generate additional video file) for the FVD computing evaluation. Here are the brief new results:

Metrics	FID-FVD	FVD
Dreampose	80.51	551.56
DisCo	59.90	292.80
DisCo+	55.17	267.75

We can see that for both baseline and our model, we got better FVD but higher FID-FVD. (Ps: we use the same generated frames for calculating the previous metric in paper and this new metric). We plan to use this new metric calculation to avoid confusion and have updated the evaluation code in the latest commit (Note: if you want to check the reproduction of the previous results, do not pull the latest commit and just download the fvd pretrained model). We will update the paper ASAP.

If you meet any further problems about the reproduction, please comment here.

Hexagoose · 2023-11-02T03:12:48Z

I can not reproduce the results with the checkpoint(TikTok Training Data), using the updated evaluation code and the provided vision model for achieving fvd metric.
I use the following command to inference and compute the metric:

NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--eval_visu --root_dir DisCo-main/ --local_train_batch_size 128 --local_eval_batch_size 128 \
--log_dir exp/tiktok-cfg \
--epochs 20 --deepspeed --eval_step 500 --save_step 500 --gradient_accumulate_steps 1 \
--learning_rate 2e-4 --fix_dist_seed --loss_target "noise"  \
--train_yaml TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \
--val_yaml TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "null" --guidance_scale 1.5 --drop_ref 0.05 --refer_sdvae --ref_null_caption False \
--combine_clip_local --combine_use_mask --conds "poses" "masks" \
--pretrained_model pretrained_model/tiktok-cfg.pt \
--eval_save_filename tiktok-cfg-check

sh gen_eval.sh exp/tiktok-cfg/tiktok-cfg-check
And the result is:
{"FVD-3DRN50": 95.92242708773176, "FVD-3DInception": 407.55894385101624}

Wangt-CN · 2023-11-03T04:42:08Z

Hi, we can try to find the issue. There are actually 2 steps to get the results (a. generate the images; b. get the metric).

Here is the generated images (pred folder) that I try to reproduce. Could you please download it and try to run gen_eval.sh on this prediction and see if can get 60 FVD results?
Then we will know if the issue is with the generation process or the evaluation process. Btw, do you follow the repo installation to install the packages?

Hexagoose · 2023-11-09T07:35:19Z

Sorry for the late reply. I have tried the image you gave me and got the correct result:
{"FVD-3DRN50": 60.145482643211295, "FVD-3DInception": 294.8605752686153}.
It seems that some factors in the generation process affect the result. I took some time to find the reason. Here are some results that I obtained：

When I use 4 GPUs, I can get a result that's relatively close to 60. I don't have a way to test the result with 8 GPUs, or maybe you've tried running inference with fewer than 8 GPUs? The number of GPUs seems to affect the test result, but I haven't found the reason why the number of GPUs affects the test result. Maybe it's because of mpirun? I'm not very sure.

fwbx529 · 2023-11-15T13:27:32Z

Hi, sorry for the late reply, as I am currently working on another project here, not focusing on video gen.
When I tested FVD before, I used an online-downloaded ckpt (as I mentioned in another issue) and I believe the number difference is due to this.
For the fps problem, I think @Wangt-CN gives very detailed corrections and results mentioned above. Thanks for the reply!

LucasLOOT · 2024-01-09T02:07:38Z

@Wangt-CN
In the context of the paper, was the FVD metric computed for video reconstruction tasks or image animation tasks? Based on the surrounding context, it seems that the FVD metric was evaluated in the context of image animation tasks. My question pertains to the absence of ground truth in image animation tasks. How is the comparison of feature distribution distance between generated and real videos handled in the absence of ground truth? Specifically, if the generated videos are created by driving source frames with a pose sequence, where are the corresponding real videos obtained? Are they directly sourced from the videos corresponding to the source frames?

Fanghaipeng · 2024-01-16T13:19:17Z

Hi, we can try to find the issue. There are actually 2 steps to get the results (a. generate the images; b. get the metric).

Here is the generated images (pred folder) that I try to reproduce. Could you please download it and try to run gen_eval.sh on this prediction and see if can get 60 FVD results?

Then we will know if the issue is with the generation process or the evaluation process. Btw, do you follow the repo installation to install the packages?

When I used a 4 NVIDIA A100 batch_size=2 and nframe=16 , and ran gen_eval_tm.sh, I obtained results similar to @asdasdad738 : "FVD-3DRN50'': 95.32, "FVD-3DInception'': 409.01. Additionally, when I used the "pred folder'' provided by @Wangt-CN, I got the correct results: "FVD-3DRN50'': 60.16, "FVD-3DInception'': 294.88. Therefore, I suspect that this issue might be caused during the generation process. Can you help me solve this problem? @Wangt-CN .

fwbx529 mentioned this issue Aug 11, 2023

Video frame 'expand' when performing FVD #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect FID-VID and FVD #25

Incorrect FID-VID and FVD #25

Fanghaipeng commented Jul 28, 2023

zhangtao22 commented Jul 28, 2023

Fanghaipeng commented Jul 28, 2023

zhangtao22 commented Jul 30, 2023

zhangtao22 commented Aug 1, 2023

fwbx529 commented Aug 8, 2023 •

edited

Loading

fwbx529 commented Aug 8, 2023 •

edited

Loading

linziqu commented Aug 21, 2023

Delicious-Bitter-Melon commented Aug 29, 2023 •

edited

Loading

Wangt-CN commented Sep 12, 2023 •

edited

Loading

Hexagoose commented Nov 2, 2023

Wangt-CN commented Nov 3, 2023

Hexagoose commented Nov 9, 2023 •

edited

Loading

fwbx529 commented Nov 15, 2023 •

edited

Loading

LucasLOOT commented Jan 9, 2024

Fanghaipeng commented Jan 16, 2024

Incorrect FID-VID and FVD #25

Incorrect FID-VID and FVD #25

Comments

Fanghaipeng commented Jul 28, 2023

zhangtao22 commented Jul 28, 2023

Fanghaipeng commented Jul 28, 2023

zhangtao22 commented Jul 30, 2023

zhangtao22 commented Aug 1, 2023

fwbx529 commented Aug 8, 2023 • edited Loading

fwbx529 commented Aug 8, 2023 • edited Loading

linziqu commented Aug 21, 2023

Delicious-Bitter-Melon commented Aug 29, 2023 • edited Loading

Wangt-CN commented Sep 12, 2023 • edited Loading

Hexagoose commented Nov 2, 2023

Wangt-CN commented Nov 3, 2023

Hexagoose commented Nov 9, 2023 • edited Loading

fwbx529 commented Nov 15, 2023 • edited Loading

LucasLOOT commented Jan 9, 2024

Fanghaipeng commented Jan 16, 2024

fwbx529 commented Aug 8, 2023 •

edited

Loading

fwbx529 commented Aug 8, 2023 •

edited

Loading

Delicious-Bitter-Melon commented Aug 29, 2023 •

edited

Loading

Wangt-CN commented Sep 12, 2023 •

edited

Loading

Hexagoose commented Nov 9, 2023 •

edited

Loading

fwbx529 commented Nov 15, 2023 •

edited

Loading