Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

23 Jun 2024 | Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach
This paper presents a study on the cross-lingual generalization of preference tuning for mitigating toxicity in large language models (LLMs). The research demonstrates that training LLMs with English preference data using Direct Preference Optimization (DPO) significantly reduces toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 languages after training. The results generalize to other multilingual LLMs such as BLOOM, Llama3, and Aya-23. The study investigates the mechanisms behind the cross-lingual generalization of DPO. It discovers the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. The value vectors in MLPs encode multilingual toxic concepts, and their activations by key vectors promote tokens associated with these concepts across multiple languages. Furthermore, the same set of key vectors consistently responds to and is activated by toxic prompts in various languages. Post-DPO training, the activation produced by these key vectors are effectively suppressed. The study also shows that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning. The findings suggest that the effectiveness of cross-lingual detoxification transfer from English to language X depends on how much English and X align in representations in the multilingual toxic subspace. This alignment is reflected in the bilingual sentence retrieval accuracy, which is used to measure the quality of language-independent representations. The research concludes that safety preference tuning with DPO can generalize across languages in a zero-shot manner. The findings are robust to different multilingual LLMs. Furthermore, the study provides a mechanistic explanation for the generalization behavior as it discovers the dual multilinguality of toxic neurons. Since generalization relies on shared multilingual representations, the study shows that bilingual sentence retrieval can predict the cross-lingual generalizability of English safety preference tuning.This paper presents a study on the cross-lingual generalization of preference tuning for mitigating toxicity in large language models (LLMs). The research demonstrates that training LLMs with English preference data using Direct Preference Optimization (DPO) significantly reduces toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 languages after training. The results generalize to other multilingual LLMs such as BLOOM, Llama3, and Aya-23. The study investigates the mechanisms behind the cross-lingual generalization of DPO. It discovers the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. The value vectors in MLPs encode multilingual toxic concepts, and their activations by key vectors promote tokens associated with these concepts across multiple languages. Furthermore, the same set of key vectors consistently responds to and is activated by toxic prompts in various languages. Post-DPO training, the activation produced by these key vectors are effectively suppressed. The study also shows that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning. The findings suggest that the effectiveness of cross-lingual detoxification transfer from English to language X depends on how much English and X align in representations in the multilingual toxic subspace. This alignment is reflected in the bilingual sentence retrieval accuracy, which is used to measure the quality of language-independent representations. The research concludes that safety preference tuning with DPO can generalize across languages in a zero-shot manner. The findings are robust to different multilingual LLMs. Furthermore, the study provides a mechanistic explanation for the generalization behavior as it discovers the dual multilinguality of toxic neurons. Since generalization relies on shared multilingual representations, the study shows that bilingual sentence retrieval can predict the cross-lingual generalizability of English safety preference tuning.
Reach us at info@study.space
Understanding Preference Tuning For Toxicity Mitigation Generalizes Across Languages