July 9, 2024 | Aakanksha*, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee*, Sara Hooker*
The paper introduces the Aya Red-teaming dataset, a multilingual benchmark for evaluating AI safety and alignment across diverse languages and cultural contexts. The dataset includes human-annotated prompts distinguishing between global and local harms, enabling the assessment of alignment techniques under non-stationary preference distributions. The authors explore various alignment methods, including supervised fine-tuning (SFT) and direct preference optimization (DPO), to balance safety and general performance. They find that DPO(SFT) achieves a significant reduction in harmful outputs while maintaining strong general performance, with a 54.7% decrease in harmful generations and a 71% win-rate against the base model. The study highlights the importance of cross-lingual transfer and the need for language-specific datasets to address cultural nuances in AI safety. The results show that global harms are generally easier to mitigate than local harms, but training on both types of harms leads to better overall performance. The paper also demonstrates that LLMs can be biased based on training data, and that human evaluations align closely with LLM-based evaluations. The work emphasizes the critical role of multilingual safety alignment in ensuring AI systems are safe and effective across diverse languages and cultures. The authors conclude that with appropriate alignment techniques and datasets, it is possible to achieve both general capabilities and safety in AI models.The paper introduces the Aya Red-teaming dataset, a multilingual benchmark for evaluating AI safety and alignment across diverse languages and cultural contexts. The dataset includes human-annotated prompts distinguishing between global and local harms, enabling the assessment of alignment techniques under non-stationary preference distributions. The authors explore various alignment methods, including supervised fine-tuning (SFT) and direct preference optimization (DPO), to balance safety and general performance. They find that DPO(SFT) achieves a significant reduction in harmful outputs while maintaining strong general performance, with a 54.7% decrease in harmful generations and a 71% win-rate against the base model. The study highlights the importance of cross-lingual transfer and the need for language-specific datasets to address cultural nuances in AI safety. The results show that global harms are generally easier to mitigate than local harms, but training on both types of harms leads to better overall performance. The paper also demonstrates that LLMs can be biased based on training data, and that human evaluations align closely with LLM-based evaluations. The work emphasizes the critical role of multilingual safety alignment in ensuring AI systems are safe and effective across diverse languages and cultures. The authors conclude that with appropriate alignment techniques and datasets, it is possible to achieve both general capabilities and safety in AI models.