Official PyTorch Implementation of ParGo: Bridging Vision-Language with Partial and Global Views. (AAAI 2025)
cd ParGo
conda create -n ParGo_env python=3.10 -y
conda activate ParGo_env
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r ./requirements.txt
The LLM(internlm2-7b) and vision_encoder(eva-clip-l-14-336) need to be downloaded in advance.
You can place the benchmark data in the benchmark directory. Data structure:
├── benchmarks
│ ├── MMEBenmark
│ └── images
│ └── Data_json
Json file in Data_json contains the image_name, question, answer, e.g.,
10002.jpg Does this artwork exist in the form of painting? Yes
Step 1: Generate the response:
python3 eval/eval_mme_finetuning.py --config ./configs/MMEBench_interLM2-7B.json
Step 2: Calculate the score:
python3 eval/calculation_mme.py --results_dir ./output/internlm2-MME
For other benchmarks, please follow their official instructions to construct the files; the overall pipeline is the same as evaluating in the MME benchmark.
This project is developed based on MiniGPT and BLIP2. Very sincere thanks to the contributors to these excellent codebases.
If you find our code helpful to your research, please consider citing us with this BibTeX:
@misc{wang2024pargobridgingvisionlanguagepartial,
title={ParGo: Bridging Vision-Language with Partial and Global Views},
author={An-Lan Wang and Bin Shan and Wei Shi and Kun-Yu Lin and Xiang Fei and Guozhi Tang and Lei Liao and Jingqun Tang and Can Huang and Wei-Shi Zheng},
year={2024},
eprint={2408.12928},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.12928},
}
The source code and pretrained weights are licensed under BSD-3-Clause