Evaluating the Elementary Multilingual Capabilities of Large Language Models with MULTIQ

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MULTIQ

18 Jul 2024 | Carolin Holtermann, Paul Röttger, Timm Dill, Anne Lauscher
This paper introduces MULTIQ, a new benchmark for evaluating the basic multilingual capabilities of open-source large language models (LLMs). The benchmark consists of 27,400 test questions in 137 typologically diverse languages, covering a wide range of topics. The goal is to assess two key aspects of multilingual capabilities: language fidelity (whether models respond in the prompted language) and question answering accuracy (whether models provide correct answers). The benchmark is designed as a silver standard, meaning it is not perfect but provides reliable insights into basic multilingual capabilities. The study evaluates six popular open-source LLMs, including Llama2, Mistral, Mixtral, and Qwen. The results show that while all models respond faithfully and/or accurately for at least some languages beyond their intended use, there are significant differences in performance across models. Mistral and Mixtral show higher language fidelity, while Qwen shows higher accuracy. However, many models perform poorly in languages outside their intended use. The study also explores the impact of tokenization on multilingual capabilities. It finds that models that can tokenize languages into subwords perform better than those that rely on characters or ASCII tokens. This suggests that tokenization strategies play a significant role in multilingual capabilities. The study also finds that models that respond in the same language as the prompt tend to be more accurate than those that respond in English. However, this is not always the case, as some models respond in other languages and still perform well. The study concludes that improving the multilingual capabilities of open-source LLMs is crucial for ensuring that language technologies benefit everyone, regardless of which language they speak. The findings suggest that further research into multilingual capabilities, particularly for underrepresented languages, is needed.This paper introduces MULTIQ, a new benchmark for evaluating the basic multilingual capabilities of open-source large language models (LLMs). The benchmark consists of 27,400 test questions in 137 typologically diverse languages, covering a wide range of topics. The goal is to assess two key aspects of multilingual capabilities: language fidelity (whether models respond in the prompted language) and question answering accuracy (whether models provide correct answers). The benchmark is designed as a silver standard, meaning it is not perfect but provides reliable insights into basic multilingual capabilities. The study evaluates six popular open-source LLMs, including Llama2, Mistral, Mixtral, and Qwen. The results show that while all models respond faithfully and/or accurately for at least some languages beyond their intended use, there are significant differences in performance across models. Mistral and Mixtral show higher language fidelity, while Qwen shows higher accuracy. However, many models perform poorly in languages outside their intended use. The study also explores the impact of tokenization on multilingual capabilities. It finds that models that can tokenize languages into subwords perform better than those that rely on characters or ASCII tokens. This suggests that tokenization strategies play a significant role in multilingual capabilities. The study also finds that models that respond in the same language as the prompt tend to be more accurate than those that respond in English. However, this is not always the case, as some models respond in other languages and still perform well. The study concludes that improving the multilingual capabilities of open-source LLMs is crucial for ensuring that language technologies benefit everyone, regardless of which language they speak. The findings suggest that further research into multilingual capabilities, particularly for underrepresented languages, is needed.
Reach us at info@study.space
[slides] Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ | StudySpace