[slides] Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

This paper evaluates the multilingual capabilities of state-of-the-art open large language models (LLMs) beyond their intended use, focusing on language fidelity and question answering accuracy. The authors introduce MULTIQ, a new benchmark dataset comprising 27,400 test questions across 137 typologically diverse languages. They find that while all tested models respond faithfully and/or accurately for at least some languages beyond their intended use, most models perform better when they respond faithfully. However, there are significant differences among models, and a long tail of languages where models are neither accurate nor faithful. The study also explores the impact of tokenization strategies on multilingual performance, suggesting that subword encoding outperforms character and ASCII encodings. The findings highlight the need for further research to improve multilingual capabilities, especially for underrepresented languages.This paper evaluates the multilingual capabilities of state-of-the-art open large language models (LLMs) beyond their intended use, focusing on language fidelity and question answering accuracy. The authors introduce MULTIQ, a new benchmark dataset comprising 27,400 test questions across 137 typologically diverse languages. They find that while all tested models respond faithfully and/or accurately for at least some languages beyond their intended use, most models perform better when they respond faithfully. However, there are significant differences among models, and a long tail of languages where models are neither accurate nor faithful. The study also explores the impact of tokenization strategies on multilingual performance, suggesting that subword encoding outperforms character and ASCII encodings. The findings highlight the need for further research to improve multilingual capabilities, especially for underrepresented languages.

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MULTIQ

18 Jul 2024 | Carolin Holtermann, Paul Röttger, Timm Dill, Anne Lauscher