Understanding Rainbow Teaming%3A Open-Ended Generation of Diverse Adversarial Prompts

**RAINBOW TEAMING: Open-Ended Generation of Diverse Adversarial Prompts** **Authors:** Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu **Institution:** Meta, University College London, University of Oxford **Date:** July 23, 2024 **Abstract:** As large language models (LLMs) become increasingly prevalent, understanding and enhancing their robustness to adversarial attacks is crucial. Existing methods for identifying adversarial prompts often focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, the authors present RAINBOW TEAMING, a novel black-box approach for generating diverse adversarial prompts. RAINBOW TEAMING casts the problem as a quality-diversity (QD) search, using open-ended search to generate both effective and diverse prompts. The method is applied to the safety domain, targeting state-of-the-art LLMs like Llama 2 and Llama 3. The approach reveals hundreds of effective adversarial prompts with an attack success rate exceeding 90% across all tested models. Additionally, fine-tuning LLMs with synthetic data generated by RAINBOW TEAMING significantly enhances their safety without compromising general performance or helpfulness. The versatility of RAINBOW TEAMING is demonstrated by its application to question answering and cybersecurity, showcasing its potential for robust open-ended self-improvement in various applications. **Key Contributions:** - **RAINBOW TEAMING:** A novel black-box approach for generating diverse adversarial prompts using QD search. - **Effectiveness:** Achieves high attack success rates (over 90%) across multiple LLMs. - **Transferability:** Synthetic data generated by RAINBOW TEAMING significantly improves LLMs' adversarial robustness. - **Versatility:** Applied to safety, question answering, and cybersecurity, demonstrating broad applicability. **Methods:** - **Quality-Diversity Search:** RAINBOW TEAMING uses a QD framework to optimize for both the quality and diversity of adversarial prompts. - **MAP-Elites:** The core algorithm for evolving and archiving adversarial prompts, ensuring a comprehensive exploration of the solution space. - **Mutation Operator:** Mutates existing prompts to generate new candidates, promoting diversity and adaptability. - **Preference Model:** Evaluates the effectiveness of prompts using a judge LLM, ensuring that the most effective prompts are selected. **Results:** - **Safety Domain:** High attack success rates (92% for Llama 2, 98% for Mistral 7B) across multiple models. - **Transfer**RAINBOW TEAMING: Open-Ended Generation of Diverse Adversarial Prompts** **Authors:** Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu **Institution:** Meta, University College London, University of Oxford **Date:** July 23, 2024 **Abstract:** As large language models (LLMs) become increasingly prevalent, understanding and enhancing their robustness to adversarial attacks is crucial. Existing methods for identifying adversarial prompts often focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, the authors present RAINBOW TEAMING, a novel black-box approach for generating diverse adversarial prompts. RAINBOW TEAMING casts the problem as a quality-diversity (QD) search, using open-ended search to generate both effective and diverse prompts. The method is applied to the safety domain, targeting state-of-the-art LLMs like Llama 2 and Llama 3. The approach reveals hundreds of effective adversarial prompts with an attack success rate exceeding 90% across all tested models. Additionally, fine-tuning LLMs with synthetic data generated by RAINBOW TEAMING significantly enhances their safety without compromising general performance or helpfulness. The versatility of RAINBOW TEAMING is demonstrated by its application to question answering and cybersecurity, showcasing its potential for robust open-ended self-improvement in various applications. **Key Contributions:** - **RAINBOW TEAMING:** A novel black-box approach for generating diverse adversarial prompts using QD search. - **Effectiveness:** Achieves high attack success rates (over 90%) across multiple LLMs. - **Transferability:** Synthetic data generated by RAINBOW TEAMING significantly improves LLMs' adversarial robustness. - **Versatility:** Applied to safety, question answering, and cybersecurity, demonstrating broad applicability. **Methods:** - **Quality-Diversity Search:** RAINBOW TEAMING uses a QD framework to optimize for both the quality and diversity of adversarial prompts. - **MAP-Elites:** The core algorithm for evolving and archiving adversarial prompts, ensuring a comprehensive exploration of the solution space. - **Mutation Operator:** Mutates existing prompts to generate new candidates, promoting diversity and adaptability. - **Preference Model:** Evaluates the effectiveness of prompts using a judge LLM, ensuring that the most effective prompts are selected. **Results:** - **Safety Domain:** High attack success rates (92% for Llama 2, 98% for Mistral 7B) across multiple models. - **Transfer

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

22 Jul 2024 | Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu*, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

22 Jul 2024 | Mikayel Samvelyan*, Sharath Chandra Raparthy*, Andrei Lupu*, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu

22 Jul 2024 | Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu*, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu