- CLIP 🚀, an improved recipe for training CLIP
- SiamLIP, BYOLIP, BarLIP, and SwALIP, non-contrastive VLP baselines extending CLIP and inspired from SimSiam, BYOL, Barlow Twins, and SwAV
- repro_exps.txt: scripts to reproduce the experiments of our paper: Improved baselines for vision-language pre-training
The code has been tested with CUDA 11.3/CuDNN 8.3.2, PyTorch 1.12.1 and timm 0.6.11.
For a minimal environment use conda env create -f clip_rocket_env.yaml
and optionally install wandb via pip.
conda:
- python=3.9
- pytorch=1.12.1=py3.9_cuda11.3_cudnn8.3.2_0 -c pytorch
- torchvision=0.13.1=py39_cu113 -c pytorch
pip:
- timm==0.6.11
- xformers==0.0.14.dev315+git.e23b369=py39_cu11.3_pyt1.12.1
- flash-attn==0.1
- textaugment==1.3.4
- nltk==3.7
- [optional] wandb
Download the YFCC100M dataset.
Our dataloader expects the following dataset directory structure with 100 folders containing 1000 zip archives of 1000 images each.
The concatenation of the folder, archive, and file names is the index of the image (i.e. image 12345678 is stored as 678.jpg
within 12/345.zip
):
/path/to/yfcc100m/
├── images/
│ ├── 00/
│ │ └── 000.zip
│ │ │ ├── 000.jpg
│ │ │ │ ...
│ │ │ └── 999.jpg
│ │ ...
│ │ └── 999.zip
│ ...
│ └── 99/
...
Prepare the YFCC15M subset metadata pickle:
- Download and compile a list of downloaded images to
flickr_unique_ids.npy
(ours) - Download OpenAI's list of captioned YFCC100M images according to instructions here
- Run
python make_dataset.py
to create theyfcc15m.pkl
metadata pickle
When pre-training with YFCC15M, set --dataset yfcc15m --root /path/to/yfcc100m --metadata /path/to/yfcc15m.pkl
.
CC3M and CC12M are published as tsv files listing original image urls and processed captions.
Download images and collect the captions of all available images (many will be missing due to broken links) into cc3m.npy
and cc12m.npy
.
For CC3M our dataloader expects cc3m.npy
to contain a NumPy array of dicts in the following format:
{
'image_id': 1510438788, # local file path relative to root
'captions': ['large field with pink tulips on a clear sunny summer day with a blue sky']
}
For CC12M our dataloader expects cc12m.npy
to contain a NumPy array of dicts in the following format:
{
'image_name': '0.jpg', # local file path relative to root
'image_id': 0,
'captions': ['Metal Design Within Reach Ivory Slipper Chairs - a Pair For Sale - Image 7 of 10']
}
When pre-training on CC3M set --dataset cc3m --root /path/to/cc3m --metadata /path/to/cc3m.npy
, and whe pre-training on CC12M set --dataset cc12m --root /path/to/cc12m --metadata /path/to/cc12m.npy
.
Zero-shot (in main.py and eval_zeroshot.py) evaluations read dataset paths from dataset_catalog.json. Zero-shot evaluations read CLIP's class labels and caption templates from labels.json and templates.json. If just pre-training models on YFCC15M, only the ImageNet path is required for model validation between training epochs. See Section 3 below on zero-shot transfer evaluation for dataset preparation details.
We use the following pre-training recipes for CLIP 🚀 and the other improved baselines. Note that in our code the model class needed for the improved recipe is marked as CL2L
.
See main.py for the full list of default arguments.
The different models can be selected by passing different strings to the --model
argument such as CL2L_R50_CL2L
or CL2L_R50_BARLIP
or CL2L_VITB16_CL2L
. As can be noted, the string is composed of three substrings: <base model>_<vision bbone>_<model name>
:
<base model>
can be eitherCLIP
orCL2L
, and defines the base class of the model, where the latter is an extension of the first that allows for mutliple augmentations and architectural improvements like multiple projectors. For this reason,CL2L
can emulate baselineCLIP
by setting--num-augs 0
.<vision bbone>
defines the vision encoder and can assume the name of whichever class implemented in models.py<model name>
defines the actual model we are going to train. Supported choices are defined inget_model()
In our workflow we use submitit, which interfaces nicely with Slurm.
For local training with the torchrun utility (supersedes torch.distributed.launch
), replace python run_with_submitit.py
with torchrun --nproc_per_node=8 main.py
.
Local multi-node training with torchrun
should also be possible. run.sh
provides a convenient wrapper to robustly run experiments based on the principle one commit --> one experiment.
We train most of our models on 4x 8-gpu nodes, but training with fewer gpus is possible by setting the --update-freq
argument above 1 to enable gradient accumulation or using --checkpoint-grad
which reduces space complexity.
Note that gradient accumulation will increase the variance of minibatch statistics and alter the training dynamics of batchnorm.
bash run.sh run_with_submitit.py \
--model CL2L_R50_CL2L \
--dataset yfcc15m \
--name CL2L_R50_CLIP-YFCC \
--separate-proj \
--text-augment \
--clean-before-augment \
--loss-avg-or-sum sum \
--label-smoothing 0.1 \
--epochs 32 \
--nodes 4 \
--batch-size 128 \
First, prepare additional downstream classification datasets:
- MNIST, CIFAR-10/100, STL-10: Automatic download via torchvision datasets
- HatefulMemes: Manual download from official website and sort images according to
train.jsonl
/dev.jsonl
into train/dev folder - Rendered SST2, Country211: Manual download from CLIP repo
- Other datasets: Use scripts from VISSL
Then set all dataset paths in dataset_catalog.json.
Evaluate zero-shot transfer to various classification benchmarks with eval_zeroshot.py, which reads labels and templates from labels.json/templates.json and dataset paths from dataset_catalog.json. Inference is performed with a single gpu. By default, the script iterates through all datasets in dataset_catalog.json and evaluates zero-shot in order. Evaluation can be limited to a subset of datasets by replacing for d in datasets:
with for d in ['imagenet']:
on line 78.
python eval_zeroshot.py --output-dir /path/to/experiment --model-name clip-rocket
This repo is mostly based on the SLIP repo. Also, we adapted some code from the CLIP repo and the timm repo . We commend the authors of these repos for the great contribution to the community.
See the CONTRIBUTING file for how to help out.
The majority of clip-rocket
is licensed under CC-BY-NC, however portions of the project are available under separate license terms: https://github.com/facebookresearch/SLIP and https://github.com/openai/CLIP are licensed under the MIT license and https://github.com/rwightman/timm is licensed under the Apache-2.0 license.. See LICENSE for details.
@article{
fini2023improved,
title={Improved baselines for vision-language pre-training},
author={Enrico Fini and Pietro Astolfi and Adriana Romero-Soriano and Jakob Verbeek and Michal Drozdzal},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=a7nvXxNmdV},
note={Featured Certification}
}