Skip to content

Latest commit

 

History

History
105 lines (88 loc) · 14.9 KB

README.md

File metadata and controls

105 lines (88 loc) · 14.9 KB

Awesome-Multilingual-LLM

The repository is designed to support the growing interest within the community in developing large language models (LLMs) that cater not only to English speakers but also to speakers of the other 6,500+ languages worldwide. Its purpose is to aid researchers in discovering pertinent literature in this field. The repository will encompass a comprehensive collection of core training and evaluation datasets, multilingual-capable LLMs, and associated scholarly articles.

Table of Contents

  1. In-Context Learning and Prompting Strategies
  2. Performance and Capabilities in Specific Languages
  3. Challenges and Limitations in Multilingual LLMs
  4. Multilingual LLMs in Programming and Code
  5. Comparative Studies and Benchmarks
  6. Datasets And Benchmarks
  7. Translation and Language Understanding
  8. Instruction Tuning
  9. Safety
  10. Miscellaneous Studies and Surveys

In-Context Learning and Prompting Strategies

Performance and Capabilities in Specific Languages

  • Holmström et al.: Explores the performance of English and multilingual LLMs in Swedish.

Challenges and Limitations in Multilingual LLMs

  • Zhu et al.: Investigates the advantages and challenges in multilingual machine translation using LLMs.

Multilingual LLMs in Programming and Code

  • Joshi et al.: Introduces RING, a multilingual repair engine powered by a language model trained on code.

Comparative Studies and Benchmarks

  • Ahuja et al.: Discusses the development and evaluation of BLOOM, a 176B-parameter open-access multilingual language model.
  • Lai et al.: Evaluates ChatGPT and other LLMs on multilingual NLP tasks.

Datasets And Benchmarks

  • [2023] CulturaX: A Large, Multilingual Dataset for Training Large Language Models: Introduces CulturaX, a large, multilingual dataset for training LLMs in 167 languages, emphasizing quality with careful data cleaning and deduplication.
  • Ladhak et al. - Introduces WikiLingua, a benchmark dataset for cross-lingual abstractive summarization in 18 languages from WikiHow (2020).
  • Gupta and Srikumar - Presents X-Fact, a multilingual dataset for factual verification in 25 languages labeled for veracity by expert fact-checkers (2021).
  • Nguyen et al. - Proposes CulturaX, with 6.3 trillion tokens in 167 languages, for training multilingual LLMs (2023).
  • Barrière et al. - Introduces a dataset of online debates in English for multilingual stance classification related to the European Green Deal (2022).
  • Wang et al. - Proposes a dataset for evaluating safeguards in LLMs and trains classifiers achieving results similar to GPT-4 (2023).
  • Laperriere et al. - Updates the French MEDIA SLU dataset for spoken language understanding, integrated into the SpeechBrain toolkit (2022).
  • Hu et al. - Introduces the Multi3WOZ dataset for training and evaluating multilingual and cross-lingual task-oriented dialog systems (2023).
  • 2023 - Presents MEGA, benchmarking generative LLMs across 70 languages and comparing them to non-autoregressive models (2023).
  • 2023 - Investigates Large Language Model-based evaluators for multilingual evaluation, highlighting bias and calibration needs (2023).
  • 2023 - Proposes the SEAHORSE dataset for evaluating multilingual, multifaceted summarization systems (2023).
  • 2023 - Proposes MINT, a multilingual textual intimacy dataset with tweets in 10 languages (2023).
  • 2023 - Evaluates ChatGPT on 37 languages across 7 tasks, revealing a performance gap compared to previous models (2023).
  • 2023 - Proposes Eva-KELLM, a benchmark for evaluating knowledge editing in LLMs with a focus on cross-lingual knowledge transfer (2023).
  • 2023 - Discusses the ComMA dataset, a multilingual dataset annotated with tags for different types of aggression and bias in four languages (2023).
  • 2023 - Introduces the GINCO training dataset for automatic genre identification of web documents.
  • 2022 - Introduces RING, a multilingual repair engine powered by a large language model trained on code for program repair in multiple languages (2022).
  • 2022 - Introduces MEE, a Multilingual Event Extraction dataset with over 50K event mentions in 8 languages (2022).
  • Bang et al. - Presents a framework for evaluating interactive LLMs using a newly designed multimodal dataset.
  • [2022] BigScience: Social Construction of a Multilingual Large Language Model: Discusses BigScience, a collaborative project that created a multilingual dataset and trained BLOOM, a multilingual LLM.
  • Zhu et al. - Proposes the CoST dataset with parallel data from 7 programming languages for code snippet translation (2022).

Translation and Language Understanding

  • Guerreiro et al.: Provides insights into the presence of hallucinations in multilingual translation models.
  • Li et al.: Discusses the translation abilities of large language models in multilingual contexts.

Instruction Tuning

Here is a list of papers related to instruction tuning applied to fine-tune large language models (LLMs) for multilingual cases, presented in markdown format with the year of publication, title hyperlinked, and a brief description:

Safety

Miscellaneous Studies and Surveys

  • Pahune et al.: Emphasizes recent developments and efforts made for various kinds of LLMs, including multilingual language models.