This repository contains PyTorch implementation of our paper Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning (CVPR 2020).
Python 3 and PyTorch 1.3.
# clone the repository
git clone [email protected]:cshizhe/hgr_v2t.git
cd hgr_v2t
export PYTHONPATH=$(pwd):${PYTHONPATH}
We provide annotations, pretrained features on MSRVTT, TGIF, VATEX and Youtube2Text video captioning datasets, which can be downloaded from BaiduNetdisk (code: vxpi).
- groundtruth: annotation/RET directory
- ref_captions.json: dict, {videoname: [sent]}
- sent2rolegraph.augment.json: {sent: (graph_nodes, graph_edges)}
-
vocabularies: annotation/RET directory int2word.npy: [word] word2int.json: {word: int}
-
data splits: public_split directory trn_names.npy, val_names.npy, tst_names.npy
For MSRVTT, TGIF and Youtube2Text datasets, we extract features with Resnet152 pretrained on ImageNet. For VATEX dataset, we use the I3D features released by VATEX challenge organizers.
- mean pooling features: ordered_feature/MP directory
format: np array, shape=(num_fts, dim_ft) corresponding to the order in data_split names
- frame-level features: ordered_feature/SA directory
format: hdf5 file, {name: ft}, ft.shape=(num_frames, dim_ft)
We construct the fine-grained binary selection dataset based on the testing set of Youtube2Text dataset. The annotations are in the Youtube2Text/annotation/binary_selection directory.
We provided constructed role graph annotations. If you want to generate role graphs for new datasets, please follow the following instructions.
- semantic role labeling:
python misc/semantic_role_labeling.py ref_caption_file out_file --cuda_device 0
- convert sentence into role graph:
cd misc
jupyter notebook
# open parse_sent_to_role_graph.ipynd
- The baseline VSE++ model:
cd t2vretrieval/driver
# setup config files
# you should modify data paths in configs/prepare_globalmatch_configs.py
python configs/prepare_globalmatch_cofig.py $datadir
resdir='' # copy the output string of the previous step
# training
python global_match.py $resdir/model.json $resdir/path.json --is_train --resume_file $resdir/../../word_embeds.glove42b.th
# inference
python global_match.py $resdir/model.json $resdir/path.json --eval_set tst
- Our HGR model:
cd t2vretrieval/driver
# setup config files
# you should modify data paths in configs/prepare_mlmatch_configs.py
python configs/prepare_mlmatch_configs.py $datadir
resdir='' # copy the output string of the previous step
# training
python multilevel_match.py $resdir/model.json $resdir/path.json --load_video_first --is_train --resume_file $resdir/../../word_embeds.glove42b.th
# inference
python multilevel_match.py $resdir/model.json $resdir/path.json --load_video_first --eval_set tst
If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:
@article{chen2020fine,
title={Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning},
author={Chen, Shizhe and Zhao, Yida and Jin, Qin and Wu, Qi},
journal={CVPR},
year={2020}
}
MIT License