Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models

This repository contains the datasets, code, and results for our ICASSP 2025 paper. We aim to enhance the comprehension of input text within the in-context learning paradigm for large multimodal models.

Our method, PMT2I, involves translating text prompts into multiple languages, constructing parallel multilingual prompts, performing inference with large multimodal models, and reranking the image candidates.

Here is our abstract.

Previous work on augmenting large multimodal models (LMMs) for text-to-image (T2I) generation has focused on enriching the input space of in-context learning (ICL). This includes providing a few demonstrations and optimizing image descriptions to be more detailed and logical. However, as demand for more complex and flexible image descriptions grows, enhancing comprehension of input text within the ICL paradigm remains a critical yet underexplored area. In this work, we extend this line of research by constructing parallel multilingual prompts aimed at harnessing the multilingual capabilities of LMMs. More specifically, we translate the input text into several languages and provide the models with both the original text and the translations. Experiments on two LMMs across 3 benchmarks show that our method, PMT2I, achieves superior performance in general, compositional, and fine-grained assessments, especially in human preference alignment. Additionally, with its advantage of generating more diverse images, PMT2I significantly outperforms baseline prompts when incorporated with reranking methods.

Overview

We release the code, data, and some results of our work in this repository, which is structured as follows:

Datasets

Contains the image descriptions of the three benchmarks used in our experiments, including CompBench, DrawBench, and MS_COCO. Note that only MS_COCO includes ground truth images, while the other datasets contain only text inputs. You can download the full MS_COCO (text and ground truth images) by clicking here.

Our Results

Stores the outcomes of our experiments. It includes prompts and translations for each dataset. The image descriptions of these benchmarks are in English. And we translated these English texts into six languages. For the MS-COCO dataset, we employed GPT-4o to translate into German and Spanish, DeepL for French and Italian translations, and NiuTrans for Russian and Chinese. For other benchmarks, GPT-4o was utilized to translate into all six languages. It's worth to note that these translations also serve as parallel multlingual T2I benchmarks, and can be used directly :)

Steps to Reproduce Results

Each step in the repository has its own Run.sh script that details how to execute the programs. You can start each step independently, but it is recommended to follow the sequential order:

Step1_Translate: Scripts for translating the English image descritions into multiple languages.
Step2_Construct_PMT2: Scripts for constructing prompts using the translated data.
Step3_LMM_Inference: Inference scripts for generating images from the constructed prompts, including Lumina and Emu2.
Step4_Evaluate: Evaluation scripts for assessing the quality of generated images, including BliScore, Clip_Dino, GPT4o-Judgement, and ImgReward.
Step5_Rerank: Scripts for reranking images based on model evaluations.

Requirements

To translate the image descriptions and request GPTo to evaluate via OpenAI's API, you will need Python 3.8+, and the following libraries:

openai
tqdm
tiktoken

Installation commands:

pip install openai tqdm datetime tiktoken

For other environments used to conduct inference and evaluation, please see the requirement files in Requirements.

Citation

If you find this repository or our results help your research, please consider to cite our paper:

@article{mu2025boosting,
  title={Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models},
  author={Mu, Yongyu and Li, Hengyu and Wang, Junxin and Zhou, Xiaoxuan and Wang, Chenglong and Luo, Yingfeng and He, Qiaozhi and Xiao, Tong and Chen, Guocheng and Zhu, Jingbo},
  journal={arXiv preprint arXiv:2501.07086},
  year={2025}
}

Acknowledgments

Thanks to everyone who contributed to this project, and special thanks to the ICASSP reviewers and our funding agencies.

Contact

For any inquiries, don't hesitate 🤗, please open an issue or contact lixiaoyumu9 [at] gmail [dot] com.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Datasets		Datasets
Figures		Figures
Our_Results		Our_Results
Requirements		Requirements
Step1_Translate		Step1_Translate
Step2_Construct_PMT2I		Step2_Construct_PMT2I
Step3_LMM_Inference		Step3_LMM_Inference
Step4_Evaluate		Step4_Evaluate
Step5_Rerank		Step5_Rerank
Utils		Utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models

Overview

Datasets

Our Results

Steps to Reproduce Results

Requirements

Citation

Acknowledgments

Contact

About

Releases

Packages

Languages

takagi97/PMT2I

Folders and files

Latest commit

History

Repository files navigation

Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models

Overview

Datasets

Our Results

Steps to Reproduce Results

Requirements

Citation

Acknowledgments

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages