- [2024.11.27] 🔥🔥🔥 We have published our report, which provides comprehensive training details and includes additional experiments.
- [2024.11.25] 🔥🔥🔥 We have released our 16-channel WF-VAE-L model along with the training code. Welcome to download it from Huggingface.
WF-VAE utilizes a multi-level wavelet transform to construct an efficient energy pathway, enabling low-frequency information from video data to flow into latent representation. This method achieves competitive reconstruction performance while markedly reducing computational costs.
- This architecture substantially improves speed and reduces training costs in large-scale video generation models and data processing workflows.
- Our experiments demonstrate competitive performance of our model against SOTA VAEs.
WF-VAE | CogVideoX |
---|---|
We conduct efficiency tests at 33-frame videos using float32 precision on an H100 GPU. All models operated without block-wise inference strategies. Our model demonstrated performance comparable to state-of-the-art VAEs while significantly reducing encoding costs.
git clone https://github.com/PKU-YuanGroup/WF-VAE
cd WF-VAE
conda create -n wfvae python=3.10 -y
conda activate wfvae
pip install -r requirements.txt
To reconstruct a video or an image, execute the following commands:
CUDA_VISIBLE_DEVICES=1 python scripts/recon_single_video.py \
--model_name WFVAE \
--from_pretrained "Your VAE" \
--video_path "Video Path" \
--rec_path rec.mp4 \
--device cuda \
--sample_rate 1 \
--num_frames 65 \
--height 512 \
--width 512 \
--fps 30 \
--enable_tiling
CUDA_VISIBLE_DEVICES=1 python scripts/recon_single_image.py \
--model_name WFVAE \
--from_pretrained "Your VAE" \
--image_path assets/gt_5544.jpg \
--rec_path rec.jpg \
--device cuda \
--short_size 512
For further guidance, refer to the example scripts: examples/rec_single_video.sh
and examples/rec_single_image.sh
.
The training & validating instruction is in TRAIN_AND_VALIDATE.md.
- Open-Sora Plan - https://github.com/PKU-YuanGroup/Open-Sora-Plan
- Allegro - https://github.com/rhymes-ai/Allegro
- CogVideoX - https://github.com/THUDM/CogVideo
- Stable Diffusion - https://github.com/CompVis/stable-diffusion
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.
@misc{li2024wfvaeenhancingvideovae,
title={WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model},
author={Zongjian Li and Bin Lin and Yang Ye and Liuhan Chen and Xinhua Cheng and Shenghai Yuan and Li Yuan},
year={2024},
eprint={2411.17459},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17459},
}
This project is released under the Apache 2.0 license as found in the LICENSE file.