WILDTTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

WILDTTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

26 Jun 2024 | Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Nilofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri
WILDTEAMING is an automatic red-teaming framework that identifies 5,700 unique clusters of jailbreak tactics from in-the-wild user-chatbot interactions. It then composes these tactics to create diverse adversarial attacks, outperforming prior methods in terms of diversity and success rate. WILDTEAMING reveals previously unknown vulnerabilities in frontier language models (LLMs), leading to up to 4.6 times more diverse and successful adversarial attacks compared to state-of-the-art jailbreaking methods. To address the lack of open-source jailbreak training data, WILDTEAMING creates WILDJAILBREAK, a large-scale open-source synthetic safety dataset with 262,000 prompt-response pairs. This dataset includes four types of queries: harmful and benign vanilla queries, and adversarial harmful and benign queries. WILDJAILBREAK provides a comprehensive resource for safety training, enabling the study of data scaling effects and the interplay of data properties and model capabilities. Through extensive training and evaluation, WILDTEAMING identifies training properties that enable a balanced safety behavior: appropriate safeguarding without over-refusal, effective handling of both vanilla and adversarial queries, and minimal impact on general capabilities. WILDJAILBREAK contributes to achieving balanced safety behaviors in models. The framework's two-step process involves mining jailbreak tactics from in-the-wild chatbot logs (MINE) and composing these tactics into diverse adversarial attacks (COMPOSE). WILDTEAMING's automatic tactics discovery, aided by GPT-4, expands the range of successful attack candidates while maintaining low computational costs. The dataset includes diverse jailbreak tactics such as "prefacing harmful content with a content warning" and "setting blame for non-compliance." WILDTEAMING's evaluation shows it outperforms other methods in attack success rate and diversity, with a significant advantage in finding multiple unique successful attacks. WILDJAILBREAK's four types of queries help evaluate models' robustness against both vanilla and adversarial harmful and benign queries. The dataset's inclusion of adversarial queries improves models' robustness against adversarial attacks. WILDTEAMING and WILDJAILBREAK contribute to safer and more transparent future models by enhancing safety training and evaluation methods. The work highlights the importance of diverse and comprehensive safety training data for improving model safety.WILDTEAMING is an automatic red-teaming framework that identifies 5,700 unique clusters of jailbreak tactics from in-the-wild user-chatbot interactions. It then composes these tactics to create diverse adversarial attacks, outperforming prior methods in terms of diversity and success rate. WILDTEAMING reveals previously unknown vulnerabilities in frontier language models (LLMs), leading to up to 4.6 times more diverse and successful adversarial attacks compared to state-of-the-art jailbreaking methods. To address the lack of open-source jailbreak training data, WILDTEAMING creates WILDJAILBREAK, a large-scale open-source synthetic safety dataset with 262,000 prompt-response pairs. This dataset includes four types of queries: harmful and benign vanilla queries, and adversarial harmful and benign queries. WILDJAILBREAK provides a comprehensive resource for safety training, enabling the study of data scaling effects and the interplay of data properties and model capabilities. Through extensive training and evaluation, WILDTEAMING identifies training properties that enable a balanced safety behavior: appropriate safeguarding without over-refusal, effective handling of both vanilla and adversarial queries, and minimal impact on general capabilities. WILDJAILBREAK contributes to achieving balanced safety behaviors in models. The framework's two-step process involves mining jailbreak tactics from in-the-wild chatbot logs (MINE) and composing these tactics into diverse adversarial attacks (COMPOSE). WILDTEAMING's automatic tactics discovery, aided by GPT-4, expands the range of successful attack candidates while maintaining low computational costs. The dataset includes diverse jailbreak tactics such as "prefacing harmful content with a content warning" and "setting blame for non-compliance." WILDTEAMING's evaluation shows it outperforms other methods in attack success rate and diversity, with a significant advantage in finding multiple unique successful attacks. WILDJAILBREAK's four types of queries help evaluate models' robustness against both vanilla and adversarial harmful and benign queries. The dataset's inclusion of adversarial queries improves models' robustness against adversarial attacks. WILDTEAMING and WILDJAILBREAK contribute to safer and more transparent future models by enhancing safety training and evaluation methods. The work highlights the importance of diverse and comprehensive safety training data for improving model safety.
Reach us at info@study.space