Ziyun Zeng*, Yuying Ge*, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge
This repo is the official implementation of the paper Learning Transferable Spatiotemporal Representations from Natural Script Knowledge.
Before you start, run the following command to set up your Python environment.
pip install -r requirement.txt
- Download YT-Temporal from here, and put the dataset under the folder
data/YTTemporal
. - Download WebVid-2M from here, and put the dataset under the folder
data/WebVid
. - Download CC3M from here, and put the dataset under the folder
data/CC3M
. - Download the split file from here, and unzip it in the root directory.
- Download SSV2 from here, and put the dataset under the folder
data/SSV2
.
We use 32 NVIDIA V100 GPUs for pre-training and downstream evaluation. The detailed hyper-parameters can be found in the Appendix.
-
Run the following script to pre-train the model on the YT-Temporal dataset. You need to download the official ImageMAE-Base weights for initialization.
bash scripts/train_yt.sh
-
Run the following script to jointly post-pretrain the model on the CC3M and WebVid-2M datasets. Note that you need to specify the variable “load_checkpoint” in
configs/dist-cc-web-pt.json
to the checkpoint path of the YT-Temporal pre-trained model.bash scripts/train_cc_web.sh
We have released our pre-trained model on Google Drive in the following links to quickly reproduce the results reported in our paper.
- YT-Temporal: https://drive.google.com/file/d/1JthEHg1ETHp5phHzjuhR1H8SfBlousYD/view?usp=sharing
- YT-Temporal + CC3M + WebVid-2M: https://drive.google.com/file/d/19WOHhJZfDtqLvzK_g_Kwr6om1hoMowMe/view?usp=sharing
Run the following scripts to evaluate different tasks on the SSV2 dataset.
-
Zero-shot Video Retrieval (Only supports single GPU evaluation currently)
bash scripts/zero_ssv2.sh
-
Linear Probe (About 7-8 hours on 32 NVIDIA V100 GPUs)
bash scripts/linear_ssv2.sh
-
Fine-tuning (About 7-8 hours on 32 NVIDIA V100 GPUs)
bash scripts/ft_ssv2.sh
-
The pre-training code is based on the official implementation of Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.
-
The downstream evaluation code is based on the official implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.
If you find our work helps, please cite our paper.
@InProceedings{Zeng_2023_CVPR,
author = {Zeng, Ziyun and Ge, Yuying and Liu, Xihui and Chen, Bin and Luo, Ping and Xia, Shu-Tao and Ge, Yixiao},
title = {Learning Transferable Spatiotemporal Representations From Natural Script Knowledge},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {23079-23089}
}
This research paper makes references to some open-source projects. Credits are given to these projects. See License.txt for details.