Releases: PKU-YuanGroup/Open-Sora-Plan
Release v1.3.1
Release v1.3.0
In version 1.3.0, Open-Sora-Plan introduced the following five key features:
- A more powerful and cost-efficient WFVAE. We decompose video into several sub-bands using wavelet transforms, naturally capturing information across different frequency domains, leading to more efficient and robust VAE learning.
- Prompt Refiner. A large language model designed to refine short text inputs.
- High-quality data cleaning strategy. The cleaned panda70m dataset retains only 27% of the original data.
- DiT with new sparse attention. A more cost-effective and efficient learning approach.
- Dynamic resolution and dynamic duration. This enables more efficient utilization of videos with varying lengths (treating a single frame as an image).
For further details, please refer to our report.
-
COMING SOON
⚡️⚡️⚡️ For large model parallelisation training, TP & SP and more strategies are coming...近期将新增华为昇腾多模态MindSpeed-MM分支,借助华为MindSpeed-MM套件的能力支撑Open-Sora Plan参数的扩增,为更大参数规模的模型训练提供TP、SP等分布式训练能力。
Release v1.2.0
v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p.
- Architecture shift from 2+1D model to 3D full attention architecture and no longer supports 2+1D.
- Instead of joint image-video training, the image weights are trained first as the initialization for the video.
- Release all data annotations, the data are filtered by aesthetic and motion.
- Improve CasualVideoVAE performance and report performance on validation set of WebVid and Panda70M.
Although the 3D attention architecture excels in spatio-temporal consistency, it is so expensive to train that it is difficult to scale up. We hope to collaborate with the open-source community to optimize the 3D DiT architecture. For further details, please refer to our report.
Release v1.1.0
- Support for longer videos, dynamic resolution training and inference.
- Support for Ascend training and inferencing
- Release all training data and annotations.
- Improve CasualVideoVAE performance.
In this version, we employ ShareGPT4Video for video annotation, followed by training the model on 3k hours of video data. The resulting model exhibited advancements in both video quality and duration. For further details, please refer to our report.
Release v1.0.0
- Added text conditional control to generate videos.
- Support HUAWEI NPU in hw branch.
- Released all training data and annotations.
- Add training, sampling scripts.
- Add CausalVideoVAE training details.
We trained all models to use 40K videos crawled from the web, most of which are landscape related content. The complete training process takes about 2048 GPU hours. More detailed changes can be found in our report.
We hope this release further benefits the community and makes text-to-video models more accessible.