28 Jun 2024 | Kelly Marchisio*, Wei-Yin Ko*, Alexandre Bérard, Théo Dehaze, Sebastian Ruder*
The paper investigates a significant limitation of large language models (LLMs): their inability to consistently generate text in the user's desired language, a phenomenon referred to as "language confusion." The authors create the Language Confusion Benchmark (LCB) to evaluate this issue, covering 15 typologically diverse languages with existing and newly created English and multilingual prompts. They evaluate various LLMs on monolingual and cross-lingual generation tasks, finding that even the strongest models, such as Llama Instruct and Mistral, exhibit high degrees of language confusion. Base and English-centric models are particularly prone to this issue, which is exacerbated by complex prompts and high sampling temperatures. The paper proposes methods to mitigate language confusion, including few-shot prompting, multilingual fine-tuning, and preference tuning. The LCB serves as an efficient and scalable tool for evaluating multilingual performance in LLMs. The study highlights the need for more equitable utility across languages and provides insights into the factors contributing to language confusion.The paper investigates a significant limitation of large language models (LLMs): their inability to consistently generate text in the user's desired language, a phenomenon referred to as "language confusion." The authors create the Language Confusion Benchmark (LCB) to evaluate this issue, covering 15 typologically diverse languages with existing and newly created English and multilingual prompts. They evaluate various LLMs on monolingual and cross-lingual generation tasks, finding that even the strongest models, such as Llama Instruct and Mistral, exhibit high degrees of language confusion. Base and English-centric models are particularly prone to this issue, which is exacerbated by complex prompts and high sampling temperatures. The paper proposes methods to mitigate language confusion, including few-shot prompting, multilingual fine-tuning, and preference tuning. The LCB serves as an efficient and scalable tool for evaluating multilingual performance in LLMs. The study highlights the need for more equitable utility across languages and provides insights into the factors contributing to language confusion.