EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

How to evaluate Text to Image Generation Model properly?

🔥 Updates

[13/12/2024] Codes and Dataset released: We releaded the traning part of evalmuse dataset(30k) and code of fga-blip2, which acheived SOTA performance in T2I model alignment evaluation.
[25/12/2024]
- EvalMuse-40k Report released: .You can find our technical report on arxiv.
- EvalMuse Leaderboard released: . You can find EvalMuse leaderboard on our website.The leaderboard contains three tracks: T2I alignment metric ranking, T2I fidelity metric ranking and T2I model alignment abality ranking.

Introduction

EvalMuse-40K is a reliable and fine-grained benchmark designed to evaluate the performance of Text-to-Image (T2I) generation models. It comprises 40,000 image-text pairs with comprehensive human annotations for image-text alignment-related tasks.Based on this dataset, we proposed two methods to evaluate t2i alignment automatically: FGA-BLIP2 and PN-VQA.

Key Features

Large-scale T2I Evaluation Dataset: Includes 40,000 image-text pairs with over 1 million fine-grained human annotations.
Diversity and Reliability: Employs strategies like balanced prompt sampling and data re-annotation to ensure diversity and reliability.
Fine-grained Evaluation: Categorizes elements during fine-grained annotation, allowing evaluation of specific skills at a granular level.
New Evaluation Methods: Introduces FGA-BLIP2 and PN-VQA methods for end-to-end fine-tuning and zero-shot fine-grained evaluation.
Leader Board:We are maintaining a ranking list of T2I models that updates weekly, shows the cutting-edge process of T2I models.

Getting Started

To use the EvalMuse-40K dataset and replicate the experiments, follow these steps:

Clone the Repository:

git clone https://github.com/DYEvaLab/EvalMuse
cd EvalMuse

Install Dependencies:
```
pip install -r requirements.txt
```

Download the Dataset and Preprocess the Data:

# Download the dataset from Huggingface
sh scripts/download.sh

# Average the annotation scores and calculate the variance of the alignment scores for different image-text pairs corresponding to the same prompt
python3 process/process_train.py

# Corresponds the element split from the prompt to a specific index in the prompt
python3 process/element2mask.py

Run the Training Scripts:
```
sh scripts/train.sh
```
Evaluate the Models:

You can download the pre-trained FGA-BLIP2 model weights from [Huggingface] or [Baidu Cloud].
```
sh scripts/eval.sh
```

Data Statistics

Alignment Score Distribution: It can be seen that the alignment scores are widely distributed, providing a rich sample for evaluating the consistency of existing models in metrics of image-text alignment with respect to human preferences.

Differences in Human Preferences: We found 75% of alignment scores differ by less than 1, showing high annotation consistency. For larger differences, we re-annotated to reduce bias.

Fine-Grained Annotation Quantity and Scores: Most categories have alignment scores around 50%, ensuring balanced positive and negative samples. And we found that AIGC models show weaker consistency in counting, spatial relationships, and activities

Results

It is recommanded to find out detailed ranking results and fine-grained level analysis on our leaderboard website.

Results on overall alignment scores

Quantitative comparison between our FGA-BLIP2 and other state-of-the-art methods which only use image-text pair to output overall alignment score on multiple benchmarks. Here, var refers to the variance optimization strategy, os represents the overall alignment score output by FGA-BLIP2, and es_avg is the average of the element scores output by FGA-BLIP2.

Method	EvalMuse-40K (SRCC)	EvalMuse-40K (PLCC)	GenAI-Bench (SRCC)	GenAI-Bench (PLCC)	TIFA (SRCC)	TIFA (PLCC)	RichHF (SRCC)	RichHF (PLCC)
CLIPScore	0.2993	0.2933	0.1676	0.203	0.3003	0.3086	0.057	0.3024
BLIPv2Score	0.3583	0.3348	0.2734	0.2979	0.4287	0.4543	0.1425	0.3105
ImageReward	0.4655	0.4585	0.34	0.3786	0.6211	0.6336	0.2747	0.3291
PickScore	0.4399	0.4328	0.3541	0.3631	0.4279	0.4342	0.3916	0.4133
HPSv2	0.3745	0.3657	0.1371	0.1693	0.3647	0.3804	0.1871	0.2577
VQAScore	0.4877	0.4841	0.5534	0.5175	0.6951	0.6585	0.4826	0.4094
FGA-BLIP2 (w/o var, os)	0.7708	0.7698	0.5548	0.5589	0.7548	0.741	0.5073	0.5384
FGA-BLIP2 (es_avg)	0.6809	0.6867	0.5206	0.5259	0.7419	0.736	0.3413	0.3096
FGA-BLIP2 (os)	0.7742	0.7722	0.5637	0.5673	0.7604	0.7442	0.5123	0.5455
FGA-BLIP2 (os+es_avg)	0.7723	0.7716	0.5638	0.5684	0.7657	0.7508	0.4576	0.4967

Results on fine-grained alignment scores

Quantitative comparison between our methods and the state-of-the-art methods for fine-grained evaluation on EvalMuse-40K. Here, we report the correlation of the method on overall alignment scores and its accuracy on fine-grained alignment. Element-GT refers to the manually annotated fine-grained scores. ‘es’ represents the element alignment score output by FGA-BLIP2.

Method	MLLMs	Overall SRCC	Overall Acc (%)	Real SRCC	Real Acc (%)	Synth SRCC	Synth Acc (%)
TIFA	LLaVA1.6	0.2937	62.1	0.2348	62.6	0.4099	60.6
	mPLUG-Owl3	0.4303	64.5	0.3890	64.5	0.5197	64.4
	Qwen2-VL	0.4145	64.5	0.3701	64.4	0.5049	64.7
VQ2*	LLaVA1.6	0.4749	67.5	0.4499	67.2	0.5314	68.4
	mPLUG-Owl3	0.5004	66.4	0.4458	65.8	0.6145	68.0
	Qwen2-VL	0.5415	67.9	0.4893	67.3	0.6653	67.0
PN-VQA* (ours)	LLaVA1.6	0.4765	66.1	0.4347	65.5	0.5486	67.7
	mPLUG-Owl3	0.5246	67.6	0.5044	67.1	0.6032	69.0
	Qwen2-VL	0.5748	68.2	0.5315	67.0	0.6946	71.9
FGA-BLIP2 (es, ours)	BLIP2	0.6800	76.8	0.6298	75.9	0.7690	79.6
Element-GT	-	0.7273	-	0.6891	-	0.7839	-

Evaluation of T2I Models on Image-Text Alignment

The table reports the overall image-text alignment scores and fine-grained alignment scores for various skills, evaluated using FGA-BLIP2. Here, a./h. is an abbreviation for animal/human.

Model	Overall Score	Attribute	Location	Color	Object	Material	A./H.	Food	Shape	Activity	Spatial	Counting
Dreamina v2.0Pro	3.74	0.821	0.793	0.706	0.747	0.689	0.756	0.700	0.580	0.662	0.747	0.477
DALLE 3	3.63	0.814	0.782	0.692	0.640	0.701	0.734	0.700	0.682	0.644	0.768	0.438
FLUX 1.1	3.47	0.819	0.758	0.660	0.642	0.638	0.686	0.673	0.607	0.596	0.671	0.362
Midjourney v6.1	3.33	0.807	0.736	0.637	0.693	0.625	0.619	0.718	0.659	0.599	0.716	0.285
SD 3	3.27	0.790	0.728	0.595	0.695	0.546	0.560	0.716	0.637	0.559	0.646	0.305
Playground v2.5	3.20	0.812	0.785	0.544	0.657	0.541	0.578	0.709	0.675	0.574	0.634	0.262
SDXL-Turbo	3.15	0.788	0.714	0.494	0.659	0.487	0.567	0.671	0.665	0.551	0.644	0.306
HunyuanDiT	3.08	0.794	0.753	0.555	0.666	0.524	0.576	0.682	0.705	0.586	0.648	0.247
Kandinsky3	3.08	0.793	0.723	0.541	0.652	0.513	0.583	0.681	0.661	0.564	0.665	0.291
SDXL	2.99	0.786	0.717	0.467	0.623	0.463	0.533	0.677	0.660	0.531	0.607	0.276
PixArt-Σ	2.98	0.792	0.755	0.564	0.633	0.533	0.561	0.692	0.703	0.533	0.641	0.238
Kolors	2.93	0.790	0.722	0.498	0.622	0.480	0.527	0.621	0.713	0.496	0.594	0.245
SDXL-Lightning	2.93	0.788	0.729	0.478	0.619	0.458	0.534	0.619	0.600	0.528	0.609	0.274
SSD1B	2.93	0.798	0.730	0.502	0.610	0.480	0.504	0.688	0.684	0.508	0.590	0.297
PixArt-α	2.88	0.780	0.738	0.483	0.607	0.472	0.521	0.627	0.670	0.523	0.600	0.240
IF	2.77	0.725	0.620	0.452	0.577	0.416	0.475	0.570	0.632	0.498	0.581	0.188
LCM-SDXL	2.77	0.762	0.706	0.465	0.575	0.454	0.513	0.616	0.615	0.496	0.587	0.273
PixArt-δ	2.73	0.768	0.718	0.455	0.565	0.432	0.486	0.634	0.685	0.496	0.574	0.207
LCM-SSD1B	2.66	0.761	0.683	0.451	0.540	0.393	0.457	0.523	0.673	0.459	0.572	0.265
SD v2.1	2.42	0.698	0.590	0.354	0.502	0.363	0.431	0.532	0.559	0.398	0.528	0.190
SD v1.5	2.25	0.671	0.534	0.328	0.470	0.337	0.372	0.487	0.500	0.352	0.488	0.180
SD v1.2	2.25	0.659	0.515	0.315	0.471	0.377	0.393	0.498	0.547	0.349	0.493	0.181

Submission Gudideline

Our EvalMuse-40K can be used to evaluate the following three tasks, including

evaluating the correlation of the overall image-text alignment scores with human preferences,
evaluating the correlation of the fine-grained image-text alignment scores with human preferences,
evaluating the performance of the T2I model on the image-text alignment task.

For evaluating model correlation with human preference, you can download our dataset from [Huggingface]. You can train with our training set (with human-annotated scores) and output the results of the model on the test set. Since the test set we don't provide human-annotated scores right now (they will be available later), you can email [email protected] to submit your result in json format and get the correlation with human preferences.

For evaluating the image-text alignment performance of the T2I model, we recommend using FGA-BLIP2, which achieves good performance in both overall alignment and fine-grained alignment.

Evaluation Method Toolkit

TBD

Contributing

We welcome contributions to EvalMuse-40K. If you have ideas or bug reports, please open an issue or submit a pull request.

Citation and Acknowledgement

If you find EvalMuse-40K useful for your research, please consider cite our paper:

@misc{han2024evalmuse40kreliablefinegrainedbenchmark,
      title={EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation}, 
      author={Shuhao Han and Haotian Fan and Jiachen Fu and Liang Li and Tao Li and Junhui Cui and Yunqiu Wang and Yang Tai and Jingwei Sun and Chunle Guo and Chongyi Li},
      year={2024},
      eprint={2412.18150},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.18150}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets/images		assets/images
demo		demo
lavis		lavis
process		process
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

Table of Contents

🔥 Updates

Introduction

Key Features

Getting Started

Data Statistics

Results

Results on overall alignment scores

Results on fine-grained alignment scores

Evaluation of T2I Models on Image-Text Alignment

Submission Gudideline

Evaluation Method Toolkit

Contributing

Citation and Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

DYEvaLab/EvalMuse

Folders and files

Latest commit

History

Repository files navigation

EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

Table of Contents

🔥 Updates

Introduction

Key Features

Getting Started

Data Statistics

Results

Results on overall alignment scores

Results on fine-grained alignment scores

Evaluation of T2I Models on Image-Text Alignment

Submission Gudideline

Evaluation Method Toolkit

Contributing

Citation and Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages