Understanding Preference Tuning For Toxicity Mitigation Generalizes Across Languages

This paper explores the zero-shot cross-lingual generalization of preference tuning for detoxifying multilingual Large Language Models (LLMs). Unlike previous work, which suggests limited cross-lingual generalization for safety tasks, the authors demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations across 17 different languages. The probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% after training. This effect generalizes to other multilingual LLMs such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools, the authors discover the dual multilinguality property of MLP layers in LLMs, explaining the cross-lingual generalization of DPO. They also show that bilingual sentence retrieval accuracy strongly correlates with the cross-lingual transferability of DPO preference tuning. The findings highlight the importance of multilingual toxicity evaluation and mitigation in LLMs, and provide insights into the mechanisms behind cross-lingual generalization.This paper explores the zero-shot cross-lingual generalization of preference tuning for detoxifying multilingual Large Language Models (LLMs). Unlike previous work, which suggests limited cross-lingual generalization for safety tasks, the authors demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations across 17 different languages. The probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% after training. This effect generalizes to other multilingual LLMs such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools, the authors discover the dual multilinguality property of MLP layers in LLMs, explaining the cross-lingual generalization of DPO. They also show that bilingual sentence retrieval accuracy strongly correlates with the cross-lingual transferability of DPO preference tuning. The findings highlight the importance of multilingual toxicity evaluation and mitigation in LLMs, and provide insights into the mechanisms behind cross-lingual generalization.

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

23 Jun 2024 | Xiaochen Li*, Zheng-Xin Yong*, Stephen H. Bach

23 Jun 2024 | Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach