We analyze the visual representation quality inside the LLM (of the MLLM) through a series of probing experiments against target representations from encoders from the OneFormer, Depth-Anything v2, and unCLIP-SD models for the seg, depth, and gen features, respectively.
Specifically, we add a single layer Perceiver Resampler based probes at every LLM layer and train the probes for 2 epochs on the COCO-train2017 set with the smooth-L1-loss objective.
Note: We trained our probes on 16 MI300X GPUs with the total batch size of 256 per iteration.
# download annotations and images
cd datasets
mkdir -p coco && wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip && unzip annotations_trainval2017.zip && wget http://images.cocodataset.org/zips/train2017.zip && unzip train.zip && wget http://images.cocodataset.org/zips/val2017.zip && unzip val2017.zip && cd ..
# depth probing
bash scripts/probe/probe.sh depth
# seg probing
bash scripts/probe/probe.sh seg
# gen probing
bash scripts/probe/probe.sh gen
You can find the probe checkpoints from our paper on the HF Hub:
[depth probes]
|[seg probes]
|[gen probes]
.
Evaluate the cosine-similarity of the probes from every LLM layer against the corresponding target representations on the COCO-val2017 images.
bash scripts/probe/eval_probe_cos_sim.sh shi-labs/OLA-VLM-CLIP-ConvNeXT-Llama3-8b depth
Evaluate the probes on the target probe task (segmentation/depth estimation/image generation) on the corresponding task-specific benchmarks.
We evaluate the accuracy on the DA-2K benchmark.
# download DA-2K benchmark
git lfs install
cd datasets/eval/ && wget https://huggingface.co/datasets/depth-anything/DA-2K/resolve/main/DA-2K.zip && unzip DA-2K.zip && cd ../../
# evaluate
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/probe/eval_probe_task.sh shi-labs/probe_depth_llava-1.5-pt-ift depth
We evaluate the mIoU on the COCO-val2017 set.
# prepare annotations
cd datasets/coco/annotations && wget http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip && unzip panoptic_annotations_trainval2017.zip && cd ../../..
python ola_vlm/eval/convert_pan_to_sem.py
# evaluate
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/probe/eval_probe_task.sh shi-labs/probe_seg_llava-1.5-pt-ift seg
We evaluate the FID on the COCO-val2017 set.
# evaluate
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/probe/eval_probe_task.sh shi-labs/probe_gen_llava-1.5-pt-ift gen