Understanding RLHF Can Speak Many Languages%3A Unlocking Multilingual Preference Optimization for LLMs

This paper addresses the challenge of multilingual preference optimization for large language models (LLMs). Despite the widespread adoption of preference optimization techniques, most research has focused on a limited set of languages like English and Chinese, neglecting the vast majority of the world's languages. The authors conduct an extensive study to achieve state-of-the-art performance in aligning multilingual LLMs. They introduce a novel method for generating high-quality multilingual feedback data, which balances data coverage. The study highlights the benefits of cross-lingual transfer and increased dataset size in preference training. The preference-trained model achieves significant improvements over state-of-the-art models, including a 54.4% win-rate against Aya 23 8B and a 69.5% win-rate against models like Gemma-1.1-7B-it, Llama-3-8B-Instruct, and Mistral-7B-Instruct-v0.3. The research expands the frontier of alignment techniques to 23 languages, covering half of the world's population. Key findings include: 1. **Cross-lingual Transfer**: Preference optimization with English data improves performance in other languages, and adding more languages significantly enhances cross-lingual transfer. 2. **Multilingual Data**: Increasing the number of languages in preference optimization training data consistently improves multilingual performance. 3. **Online vs Offline Optimization**: Online preference optimization (RLOO) outperforms offline optimization (DPO) in terms of overall performance and cross-lingual transfer. 4. **Model Performance**: The preference-trained Aya 23 8B model outperforms both the original Aya 23 8B and widely used open-source models in various tasks. The study also discusses the limitations of the approach, such as the potential for cultural biases in synthetic datasets and the need for further exploration of larger models and more diverse languages.This paper addresses the challenge of multilingual preference optimization for large language models (LLMs). Despite the widespread adoption of preference optimization techniques, most research has focused on a limited set of languages like English and Chinese, neglecting the vast majority of the world's languages. The authors conduct an extensive study to achieve state-of-the-art performance in aligning multilingual LLMs. They introduce a novel method for generating high-quality multilingual feedback data, which balances data coverage. The study highlights the benefits of cross-lingual transfer and increased dataset size in preference training. The preference-trained model achieves significant improvements over state-of-the-art models, including a 54.4% win-rate against Aya 23 8B and a 69.5% win-rate against models like Gemma-1.1-7B-it, Llama-3-8B-Instruct, and Mistral-7B-Instruct-v0.3. The research expands the frontier of alignment techniques to 23 languages, covering half of the world's population. Key findings include: 1. **Cross-lingual Transfer**: Preference optimization with English data improves performance in other languages, and adding more languages significantly enhances cross-lingual transfer. 2. **Multilingual Data**: Increasing the number of languages in preference optimization training data consistently improves multilingual performance. 3. **Online vs Offline Optimization**: Online preference optimization (RLOO) outperforms offline optimization (DPO) in terms of overall performance and cross-lingual transfer. 4. **Model Performance**: The preference-trained Aya 23 8B model outperforms both the original Aya 23 8B and widely used open-source models in various tasks. The study also discusses the limitations of the approach, such as the potential for cultural biases in synthetic datasets and the need for further exploration of larger models and more diverse languages.

RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs

2 Jul 2024 | John Dang, Arash Ahmadian, Kelly Marchisio, Julia Kreutzer, Ahmet Üstün, Sara Hooker