28 Jun 2024 | Kelly Marchisio*, Wei-Yin Ko*, Alexandre Bérard, Théo Dehaze, Sebastian Ruder*
This paper investigates a significant limitation of large language models (LLMs): their inability to consistently generate text in the user's desired language, leading to "language confusion." To address this issue, the authors introduce the Language Confusion Benchmark (LCB), which evaluates LLMs across 15 typologically diverse languages using both existing and newly created prompts. The benchmark assesses monolingual and cross-lingual generation, revealing that models like Llama Instruct and Mistral exhibit severe language confusion, even the strongest models fail to consistently respond in the correct language. Base and English-centric models are more prone to language confusion, which is exacerbated by complex prompts and high sampling temperatures. The study finds that language confusion can be partially mitigated through few-shot prompting, multilingual SFT, and preference tuning. The LCB serves as a first layer of efficient, scalable multilingual evaluation. The paper also explores the impact of dataset, prompt length, instruction position, and quantization on language confusion. It proposes methods to reduce temperature and nucleus size, use few-shot prompting, and apply multilingual instruction tuning to mitigate language confusion. The findings highlight the importance of addressing language confusion to ensure equal utility of LLMs across languages.This paper investigates a significant limitation of large language models (LLMs): their inability to consistently generate text in the user's desired language, leading to "language confusion." To address this issue, the authors introduce the Language Confusion Benchmark (LCB), which evaluates LLMs across 15 typologically diverse languages using both existing and newly created prompts. The benchmark assesses monolingual and cross-lingual generation, revealing that models like Llama Instruct and Mistral exhibit severe language confusion, even the strongest models fail to consistently respond in the correct language. Base and English-centric models are more prone to language confusion, which is exacerbated by complex prompts and high sampling temperatures. The study finds that language confusion can be partially mitigated through few-shot prompting, multilingual SFT, and preference tuning. The LCB serves as a first layer of efficient, scalable multilingual evaluation. The paper also explores the impact of dataset, prompt length, instruction position, and quantization on language confusion. It proposes methods to reduce temperature and nucleus size, use few-shot prompting, and apply multilingual instruction tuning to mitigate language confusion. The findings highlight the importance of addressing language confusion to ensure equal utility of LLMs across languages.