Understanding WildTeaming at Scale%3A From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

**WILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models** **Authors:** Liwei Jiang **Affiliations:** University of Washington, Allen Institute for Artificial Intelligence, Seoul National University, Carnegie Mellon University **Contact Information:** lwjiang@cs.washington.edu, nouhad@allenai.org **Abstract:** WILDTEAMING is an automatic red-teaming framework that discovers novel jailbreak tactics from real-world user-chatbot interactions, generating 5.7K unique clusters. It then composes these tactics to create diverse and challenging adversarial attacks. Compared to previous methods, WILDTEAMING reveals more diverse and successful attacks, up to 4.6 times more, by mining from chatbot users who were not specifically instructed to break the system. To address the lack of open-source safety training data, WILDTEAMING creates WILDJAILBREAK, a large-scale synthetic safety dataset with 262K prompt-response pairs. This dataset includes four types of queries: vanilla harmful, vanilla benign, adversarial harmful, and adversarial benign. Extensive experiments show that training on WILDJAILBREAK improves safety behaviors without over-refusal, achieving a balanced trade-off between safety and general capabilities. The work highlights the importance of diverse and comprehensive safety training resources for building robust and safe language models. **Key Contributions:** 1. **WILDTEAMING:** An automatic framework for discovering and composing novel jailbreak tactics. 2. **WILDJAILBREAK:** A large-scale synthetic safety training dataset with 262K prompt-response pairs. 3. **Balanced Safety Training:** Demonstrates the effectiveness of training on both vanilla and adversarial queries to achieve balanced safety behaviors. **Methods:** - **WILDTEAMING:** Mined 105K human-devised jailbreak tactics from real-world user-chatbot interactions and composed them into diverse adversarial attacks. - **WILDJAILBREAK:** Created a dataset with 262K prompt-response pairs, including vanilla and adversarial queries, to enhance safety training. **Results:** - **Effectiveness:** WILDTEAMING identified up to 4.6 times more unique successful attacks compared to state-of-the-art methods. - **Safety Training:** Training on WILDJAILBREAK improved safety behaviors without over-refusal, achieving a balanced trade-off between safety and general capabilities. **Discussion:** - **Comprehensive Safety Training:** Emphasizes the need for open and diverse safety training resources. - **Evaluation Methods:** Suggests the need for evolving safety evaluation methods to keep pace with improving model capabilities. - **Internal Mechanisms:** Highlights the importance of understanding the underlying mechanisms of safety alignment approaches. **Conclusion:** WILDTEAMING and WILDJAILBREAK provide a comprehensive approach to discovering and addressing vulnerabilities in language models, enhancing their safety and robustness.**WILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models** **Authors:** Liwei Jiang **Affiliations:** University of Washington, Allen Institute for Artificial Intelligence, Seoul National University, Carnegie Mellon University **Contact Information:** lwjiang@cs.washington.edu, nouhad@allenai.org **Abstract:** WILDTEAMING is an automatic red-teaming framework that discovers novel jailbreak tactics from real-world user-chatbot interactions, generating 5.7K unique clusters. It then composes these tactics to create diverse and challenging adversarial attacks. Compared to previous methods, WILDTEAMING reveals more diverse and successful attacks, up to 4.6 times more, by mining from chatbot users who were not specifically instructed to break the system. To address the lack of open-source safety training data, WILDTEAMING creates WILDJAILBREAK, a large-scale synthetic safety dataset with 262K prompt-response pairs. This dataset includes four types of queries: vanilla harmful, vanilla benign, adversarial harmful, and adversarial benign. Extensive experiments show that training on WILDJAILBREAK improves safety behaviors without over-refusal, achieving a balanced trade-off between safety and general capabilities. The work highlights the importance of diverse and comprehensive safety training resources for building robust and safe language models. **Key Contributions:** 1. **WILDTEAMING:** An automatic framework for discovering and composing novel jailbreak tactics. 2. **WILDJAILBREAK:** A large-scale synthetic safety training dataset with 262K prompt-response pairs. 3. **Balanced Safety Training:** Demonstrates the effectiveness of training on both vanilla and adversarial queries to achieve balanced safety behaviors. **Methods:** - **WILDTEAMING:** Mined 105K human-devised jailbreak tactics from real-world user-chatbot interactions and composed them into diverse adversarial attacks. - **WILDJAILBREAK:** Created a dataset with 262K prompt-response pairs, including vanilla and adversarial queries, to enhance safety training. **Results:** - **Effectiveness:** WILDTEAMING identified up to 4.6 times more unique successful attacks compared to state-of-the-art methods. - **Safety Training:** Training on WILDJAILBREAK improved safety behaviors without over-refusal, achieving a balanced trade-off between safety and general capabilities. **Discussion:** - **Comprehensive Safety Training:** Emphasizes the need for open and diverse safety training resources. - **Evaluation Methods:** Suggests the need for evolving safety evaluation methods to keep pace with improving model capabilities. - **Internal Mechanisms:** Highlights the importance of understanding the underlying mechanisms of safety alignment approaches. **Conclusion:** WILDTEAMING and WILDJAILBREAK provide a comprehensive approach to discovering and addressing vulnerabilities in language models, enhancing their safety and robustness.

WILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

26 Jun 2024 | Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri