This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Public Chinese Video-text Dataset>, which has been accepted by CVPR 2023.
- The codebase are now available at AntMMF-CNVid-VTP.
- Another instruction/project that may be helpful to download original videos in the CNVid-3.5M dataset: MediaCrawler.
- If you have any questions about CNVid-3.5M, it is recommended to raise your questions at AntMMF project.
Please first check TERMS.md, LEGAL.md, and LICENSE. You must not use the content in this dataset if you do not agree to the terms, legal disclaimer, and license outlined in these files.
We note that we do not own the copyright to any of the collected data. The distribution of identities and activities in the CNVid-3.5M dataset may not be representative of the global human population and the diversity in society. Please be careful of unintended societal, gender, racial, and other biases when training or deploying models trained on this data.
CNVid-3.5M is a large-scale public cross-modal dataset containing over 3.5 Million Chinese video-text pairs. We summarize our contributions by three verbs, i.e., “Build”, “Filter”, and “Pre-train”: 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset.
Check DATASET.md for instructions of dataset downloading and preprocessing (CNVid-3.5M).
The codebase are now available at AntMMF-CNVid-VTP.
Check BENCHMARK.md for instructions of benchmark downloading and model fine-tuning (CNVid-3.5M).
We have already prepared the benchmark, but we still need some time to obtain the external disclosure authorization from our group. All benchmarks are planned to be published in January 2024.
If you find CNVid-3.5M useful, please consider citing the following paper:
@inproceedings{gan2023cnvid,
title={CNVid-3.5 M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset},
author={Gan, Tian and Wang, Qing and Dong, Xingning and Ren, Xiangyuan and Nie, Liqiang and Guo, Qingpei},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={14815--14824},
year={2023}
}