Skip to content

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)

License

Notifications You must be signed in to change notification settings

thu-ml/MMTrustEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

background
🌐 Project Page Β Β  πŸ“– arXiv Paper Β Β  πŸ“œ Documentation Β Β  πŸ“Š Dataset Β Β  πŸ€— Hugging Face Β Β  πŸ† Leaderboard

Truthfulness Safety Robustness Fairness Privacy


MultiTrust is a comprehensive benchmark designed to assess and enhance the trustworthiness of MLLMs across five key dimensions: truthfulness, safety, robustness, fairness, and privacy. It integrates a rigorous evaluation strategy involving 32 diverse tasks to expose new trustworthiness challenges.

framework

πŸš€ News

πŸ› οΈ Installation

The envionment of this version has been updated to accommodate more latest models. If you want to ensure more precise replication of experimental results presented in the paper, you could switch to the branch v0.1.0.

  • Option A: Pip install

    conda create -n multitrust python=3.9
    conda activate multitrust
    
    # Note: Tsinghua Source can be discarded.
    pip install -r env/requirements.txt
  • Option B: Docker

    • (Optional) Commands to install Docker
    # Our docker version:
    #     Client: Docker Engine - Community
    #     Version:           27.0.0-rc.1
    #     API version:       1.46
    #     Go version:        go1.21.11
    #     OS/Arch:           linux/amd64
    
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    
    sudo apt-get update
    sudo apt-get install -y nvidia-container-toolkit
    
    sudo systemctl restart docker
    sudo usermod -aG docker [your_username_here]
    • Commands to install environment
    #  Note: 
    # [code] is an `absolute path` of project root: abspath(./)
    # [data] and [playground] are `absolute paths` of data and model_playground(decompress our provided data/playground).
    
    docker build -t multitrust:v0.0.1 -f env/Dockerfile .
    
    docker run -it \
        --name multitrust \
        --gpus all \
        --privileged=true \
        --shm-size=10gb \
        -v /home/[your_user_name_here]/.cache/huggingface:/root/.cache/huggingface \
        -v /home/[your_user_name_here]/.cache/torch:/root/.cache/torch \
        -v [code]:/root/multitrust \
        -v [data]:/root/multitrust/data \
        -v [playground]:/root/multitrust/playground \
        -w /root/multitrust \
        -p 11180:22 \
        -p 8000:8000 \
        -d multitrust:v0.0.1 /bin/bash
    
    # entering the container by docker exec
    docker exec -it multitrust /bin/bash
    
    # or entering by ssh
    ssh -p 11180 root@[your_ip_here]
  • Several tasks require the use of commercial APIs for auxiliary testing. Therefore, if you want to test all tasks, please add the corresponding model API keys in env/apikey.yml.

βœ‰οΈ Dataset

License

  • The codebase is licensed under the CC BY-SA 4.0 license.

  • MultiTrust is only used for academic research. Commercial use in any form is prohibited.

  • If there is any infringement in MultiTrust, please directly raise an issue, and we will remove it immediately.

Data Preparation

Refer here for detailed instructions.

πŸ“š Docs

Our document presents interface definitions for different modules and some tutorials on how to extend modules. Running online at: https://thu-ml.github.io/MMTrustEval/

Run following command to see the docs(locally).

mkdocs serve -f env/mkdocs.yml -a 0.0.0.0:8000

πŸ“ˆ Reproduce results in Our paper

Running scripts under scripts/run can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner.

πŸ“Œ To Make Inference

# Description: Run scripts require a model_id to run inference tasks.
# Usage: bash scripts/run/*/*.sh <model_id>

scripts/run
β”œβ”€β”€ fairness_scripts
β”‚   β”œβ”€β”€ f1-stereo-generation.sh
β”‚   β”œβ”€β”€ f2-stereo-agreement.sh
β”‚   β”œβ”€β”€ f3-stereo-classification.sh
β”‚   β”œβ”€β”€ f3-stereo-topic-classification.sh
β”‚   β”œβ”€β”€ f4-stereo-query.sh
β”‚   β”œβ”€β”€ f5-vision-preference.sh
β”‚   β”œβ”€β”€ f6-profession-pred.sh
β”‚   └── f7-subjective-preference.sh
β”œβ”€β”€ privacy_scripts
β”‚   β”œβ”€β”€ p1-vispriv-recognition.sh
β”‚   β”œβ”€β”€ p2-vqa-recognition-vispr.sh
β”‚   β”œβ”€β”€ p3-infoflow.sh
β”‚   β”œβ”€β”€ p4-pii-query.sh
β”‚   β”œβ”€β”€ p5-visual-leakage.sh
β”‚   └── p6-pii-leakage-in-conversation.sh
β”œβ”€β”€ robustness_scripts
β”‚   β”œβ”€β”€ r1-ood-artistic.sh
β”‚   β”œβ”€β”€ r2-ood-sensor.sh
β”‚   β”œβ”€β”€ r3-ood-text.sh
β”‚   β”œβ”€β”€ r4-adversarial-untarget.sh
β”‚   β”œβ”€β”€ r5-adversarial-target.sh
β”‚   └── r6-adversarial-text.sh
β”œβ”€β”€ safety_scripts
β”‚   β”œβ”€β”€ s1-nsfw-image-description.sh
β”‚   β”œβ”€β”€ s2-risk-identification.sh
β”‚   β”œβ”€β”€ s3-toxic-content-generation.sh
β”‚   β”œβ”€β”€ s4-typographic-jailbreaking.sh
β”‚   β”œβ”€β”€ s5-multimodal-jailbreaking.sh
β”‚   └── s6-crossmodal-jailbreaking.sh
└── truthfulness_scripts
    β”œβ”€β”€ t1-basic.sh
    β”œβ”€β”€ t2-advanced.sh
    β”œβ”€β”€ t3-instruction-enhancement.sh
    β”œβ”€β”€ t4-visual-assistance.sh
    β”œβ”€β”€ t5-text-misleading.sh
    β”œβ”€β”€ t6-visual-confusion.sh
    └── t7-visual-misleading.sh

πŸ“Œ To Evaluate Results

After that, scripts under scripts/score can be used to calculate the statistical results based on the outputs and show the results reported in the paper.

# Description: Run scripts require a model_id to calculate statistical results.
# Usage: python scripts/score/*/*.py --model_id <model_id>

scripts/score
β”œβ”€β”€ fairness
β”‚   β”œβ”€β”€ f1-stereo-generation.py
β”‚   β”œβ”€β”€ f2-stereo-agreement.py
β”‚   β”œβ”€β”€ f3-stereo-classification.py
β”‚   β”œβ”€β”€ f3-stereo-topic-classification.py
β”‚   β”œβ”€β”€ f4-stereo-query.py
β”‚   β”œβ”€β”€ f5-vision-preference.py
β”‚   β”œβ”€β”€ f6-profession-pred.py
β”‚   └── f7-subjective-preference.py
β”œβ”€β”€ privacy
β”‚   β”œβ”€β”€ p1-vispriv-recognition.py
β”‚   β”œβ”€β”€ p2-vqa-recognition-vispr.py
β”‚   β”œβ”€β”€ p3-infoflow.py
β”‚   β”œβ”€β”€ p4-pii-query.py
β”‚   β”œβ”€β”€ p5-visual-leakage.py
β”‚   └── p6-pii-leakage-in-conversation.py
β”œβ”€β”€ robustness
β”‚   β”œβ”€β”€ r1-ood_artistic.py
β”‚   β”œβ”€β”€ r2-ood_sensor.py
β”‚   β”œβ”€β”€ r3-ood_text.py
β”‚   β”œβ”€β”€ r4-adversarial_untarget.py
β”‚   β”œβ”€β”€ r5-adversarial_target.py
β”‚   └── r6-adversarial_text.py
β”œβ”€β”€ safefy
β”‚   β”œβ”€β”€ s1-nsfw-image-description.py
β”‚   β”œβ”€β”€ s2-risk-identification.py
β”‚   β”œβ”€β”€ s3-toxic-content-generation.py
β”‚   β”œβ”€β”€ s4-typographic-jailbreaking.py
β”‚   β”œβ”€β”€ s5-multimodal-jailbreaking.py
β”‚   └── s6-crossmodal-jailbreaking.py
└── truthfulness
    β”œβ”€β”€ t1-basic.py
    β”œβ”€β”€ t2-advanced.py
    β”œβ”€β”€ t3-instruction-enhancement.py
    β”œβ”€β”€ t4-visual-assistance.py
    β”œβ”€β”€ t5-text-misleading.py
    β”œβ”€β”€ t6-visual-confusion.py
    └── t7-visual-misleading.py

πŸ“Œ Task List

The total 32 tasks are listed here and β—‹: rule-based evaluation (e.g., keywords matching); ●: automatic evaluation by GPT-4 or other classifiers; ◐: mixture evaluation.

ID Task Name Metrics Task Type Eval
T.1 Basic World Understanding Accuracy ($\uparrow$) Dis.&Gen. ◐
T.2 Advanced Cognitive Inference Accuracy ($\uparrow$) Dis. β—‹
T.3 VQA under Instruction Enhancement Accuracy ($\uparrow$) Gen. ●
T.4 QA under Visual Assistance Accuracy ($\uparrow$) Gen. ●
T.5 Text Misleading VQA Accuracy ($\uparrow$) Gen. ●
T.6 Visual Confusion VQA Accuracy ($\uparrow$) Gen. β—‹
T.7 Visual Misleading QA Accuracy ($\uparrow$) Dis. ●
S.1 Risk Identification Accuracy ($\uparrow$) Dis.&Gen. ◐
S.2 Image Description Toxicity Score ($\downarrow$), RtA ($\uparrow$) Gen. ●
S.3 Toxicity Content Generation Toxicity Score ($\downarrow$), RtA ($\uparrow$) Gen. ◐
S.4 Plain Typographic Jailbreaking ASR ($\downarrow$), RtA ($\uparrow$) Gen. ◐
S.5 Optimized Multimodal Jailbreaking ASR ($\downarrow$), RtA ($\uparrow$) Gen. ◐
S.6 Cross-modal Influence on Jailbreaking ASR ($\downarrow$), RtA ($\uparrow$) Gen. ◐
R.1 VQA for Artistic Style images Score ($\uparrow$) Gen. ◐
R.2 VQA for Sensor Style images Score ($\uparrow$) Gen. ●
R.3 Sentiment Analysis for OOD texts Accuracy ($\uparrow$) Dis. β—‹
R.4 Image Captioning under Untarget attack Accuracy ($\uparrow$) Gen. ◐
R.5 Image Captioning under Target attack Attack Success Rate ($\downarrow$) Gen. ◐
R.6 Textual Adversarial Attack Accuracy ($\uparrow$) Dis. β—‹
F.1 Stereotype Content Detection Containing Rate ($\downarrow$) Gen. ●
F.2 Agreement on Stereotypes Agreement Percentage ($\downarrow$) Dis. ◐
F.3 Classification of Stereotypes Accuracy ($\uparrow$) Dis. β—‹
F.4 Stereotype Query Test RtA ($\uparrow$) Gen. ◐
F.5 Preference Selection in VQA RtA ($\uparrow$) Gen. ●
F.6 Profession Prediction Pearson’s correlation ($\uparrow$) Gen. ◐
F.7 Preference Selection in QA RtA ($\uparrow$) Gen. ●
P.1 Visual Privacy Recognition Accuracy, F1 ($\uparrow$) Dis. β—‹
P.2 Privacy-sensitive QA Recognition Accuracy, F1 ($\uparrow$) Dis. β—‹
P.3 InfoFlow Expectation Pearson's Correlation ($\uparrow$) Gen. β—‹
P.4 PII Query with Visual Cues RtA ($\uparrow$) Gen. ◐
P.5 Privacy Leakage in Vision RtA ($\uparrow$), Accuracy ($\uparrow$) Gen. ◐
P.6 PII Leakage in Conversations RtA ($\uparrow$) Gen. ◐

βš›οΈ Overall Results

  • Proprietary models like GPT-4V and Claude3 demonstrate consistently top performance due to enhancements in alignment and safety filters compared with open-source models.
  • A global analysis reveals a correlation coefficient of 0.60 between general capabilities and trustworthiness of MLLMs, indicating that more powerful general abilities could help better trustworthiness to some extent.
  • Finer correlation analysis shows no significant link across different aspects of trustworthiness, highlighting the need for comprehensive aspect division and identifying gaps in achieving trustworthiness.
result

βœ’οΈ Citation

If you find our work helpful for your research, please consider citing our work.

@misc{zhang2024benchmarking,
      title={Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study}, 
      author={Yichi Zhang and Yao Huang and Yitong Sun and Chang Liu and Zhe Zhao and Zhengwei Fang and
              Yifan Wang and Huanran Chen and Xiao Yang and Xingxing Wei and Hang Su and Yinpeng Dong and
              Jun Zhu},
      year={2024},
      eprint={2406.07057},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
    }