Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
base		base
configs		configs
data_loader		data_loader
downstream		downstream
logger		logger
model		model
scripts		scripts
trainer		trainer
utils		utils
video_transforms		video_transforms
License.txt		License.txt
README.md		README.md
parse_config.py		parse_config.py
parse_config_dist_multi.py		parse_config_dist_multi.py
requirement.txt		requirement.txt
train_dist_TVTS.py		train_dist_TVTS.py

README.md

[CVPR 2023] Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Ziyun Zeng*, Yuying Ge*, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

This repo is the official implementation of the paper Learning Transferable Spatiotemporal Representations from Natural Script Knowledge.

Main Results

Transferability Evaluation

Action Recognition

Text-to-Video Retrieval

Instruction

Environment Setup

Before you start, run the following command to set up your Python environment.

pip install -r requirement.txt

Dataset Preparation

Pre-training Datasets

Download YT-Temporal from here, and put the dataset under the folder data/YTTemporal.
Download WebVid-2M from here, and put the dataset under the folder data/WebVid.
Download CC3M from here, and put the dataset under the folder data/CC3M.
Download the split file from here, and unzip it in the root directory.

Downstream Datasets

Download SSV2 from here, and put the dataset under the folder data/SSV2.

Training and Evaluation

We use 32 NVIDIA V100 GPUs for pre-training and downstream evaluation. The detailed hyper-parameters can be found in the Appendix.

Pre-training

Run the following script to pre-train the model on the YT-Temporal dataset. You need to download the official ImageMAE-Base weights for initialization.
```
bash scripts/train_yt.sh
```
Run the following script to jointly post-pretrain the model on the CC3M and WebVid-2M datasets. Note that you need to specify the variable “load_checkpoint” in configs/dist-cc-web-pt.json to the checkpoint path of the YT-Temporal pre-trained model.
```
bash scripts/train_cc_web.sh
```

Downstream Evaluation

We have released our pre-trained model on Google Drive in the following links to quickly reproduce the results reported in our paper.

YT-Temporal: https://drive.google.com/file/d/1JthEHg1ETHp5phHzjuhR1H8SfBlousYD/view?usp=sharing
YT-Temporal + CC3M + WebVid-2M: https://drive.google.com/file/d/19WOHhJZfDtqLvzK_g_Kwr6om1hoMowMe/view?usp=sharing

Run the following scripts to evaluate different tasks on the SSV2 dataset.

Zero-shot Video Retrieval (Only supports single GPU evaluation currently)
```
bash scripts/zero_ssv2.sh
```
Linear Probe (About 7-8 hours on 32 NVIDIA V100 GPUs)
```
bash scripts/linear_ssv2.sh
```
Fine-tuning (About 7-8 hours on 32 NVIDIA V100 GPUs)
```
bash scripts/ft_ssv2.sh
```

Acknowledgement

The pre-training code is based on the official implementation of Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.
The downstream evaluation code is based on the official implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.

Citation

If you find our work helps, please cite our paper.

@InProceedings{Zeng_2023_CVPR,
    author    = {Zeng, Ziyun and Ge, Yuying and Liu, Xihui and Chen, Bin and Luo, Ping and Xia, Shu-Tao and Ge, Yixiao},
    title     = {Learning Transferable Spatiotemporal Representations From Natural Script Knowledge},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {23079-23089}
}

License

This research paper makes references to some open-source projects. Credits are given to these projects. See License.txt for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1

v1

README.md

[CVPR 2023] Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Main Results

Transferability Evaluation

Action Recognition

Text-to-Video Retrieval

Instruction

Environment Setup

Dataset Preparation

Pre-training Datasets

Downstream Datasets

Training and Evaluation

Pre-training

Downstream Evaluation

Acknowledgement

Citation

License

Files

v1

Directory actions

More options

Directory actions

More options

Latest commit

History

v1

Folders and files

parent directory

README.md

[CVPR 2023] Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Main Results

Transferability Evaluation

Action Recognition

Text-to-Video Retrieval

Instruction

Environment Setup

Dataset Preparation

Pre-training Datasets

Downstream Datasets

Training and Evaluation

Pre-training

Downstream Evaluation

Acknowledgement

Citation

License