Skip to content

Releases: PKU-YuanGroup/Open-Sora-Plan

Release v1.3.1

22 Oct 11:20
Compare
Choose a tag to compare
  1. Fixed bug in missing vae code, #475
  2. Fixed explicit_uniform_sampling function that proposed by CogVideoX, 390de3f
  3. Release prompt refiner code and data, more details can be found here.
  4. Updated the weight for prompt-refiner.00003 and text-to-video weights.

Release v1.3.0

15 Oct 18:03
Compare
Choose a tag to compare

In version 1.3.0, Open-Sora-Plan introduced the following five key features:

  1. A more powerful and cost-efficient WFVAE. We decompose video into several sub-bands using wavelet transforms, naturally capturing information across different frequency domains, leading to more efficient and robust VAE learning.
  2. Prompt Refiner. A large language model designed to refine short text inputs.
  3. High-quality data cleaning strategy. The cleaned panda70m dataset retains only 27% of the original data.
  4. DiT with new sparse attention. A more cost-effective and efficient learning approach.
  5. Dynamic resolution and dynamic duration. This enables more efficient utilization of videos with varying lengths (treating a single frame as an image).

For further details, please refer to our report.

  • COMING SOON ⚡️⚡️⚡️ For large model parallelisation training, TP & SP and more strategies are coming...

    近期将新增华为昇腾多模态MindSpeed-MM分支,借助华为MindSpeed-MM套件的能力支撑Open-Sora Plan参数的扩增,为更大参数规模的模型训练提供TP、SP等分布式训练能力。

Release v1.2.0

25 Jul 06:28
adb2a20
Compare
Choose a tag to compare

v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p.

  • Architecture shift from 2+1D model to 3D full attention architecture and no longer supports 2+1D.
  • Instead of joint image-video training, the image weights are trained first as the initialization for the video.
  • Release all data annotations, the data are filtered by aesthetic and motion.
  • Improve CasualVideoVAE performance and report performance on validation set of WebVid and Panda70M.

Although the 3D attention architecture excels in spatio-temporal consistency, it is so expensive to train that it is difficult to scale up. We hope to collaborate with the open-source community to optimize the 3D DiT architecture. For further details, please refer to our report.

Release v1.1.0

27 May 10:02
2a8b232
Compare
Choose a tag to compare
  • Support for longer videos, dynamic resolution training and inference.
  • Support for Ascend training and inferencing
  • Release all training data and annotations.
  • Improve CasualVideoVAE performance.

In this version, we employ ShareGPT4Video for video annotation, followed by training the model on 3k hours of video data. The resulting model exhibited advancements in both video quality and duration. For further details, please refer to our report.

Release v1.0.0

09 Apr 06:43
Compare
Choose a tag to compare
  • Added text conditional control to generate videos.
  • Support HUAWEI NPU in hw branch.
  • Released all training data and annotations.
  • Add training, sampling scripts.
  • Add CausalVideoVAE training details.

We trained all models to use 40K videos crawled from the web, most of which are landscape related content. The complete training process takes about 2048 GPU hours. More detailed changes can be found in our report.

We hope this release further benefits the community and makes text-to-video models more accessible.