EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

18 Mar 2024 | Weikang Zhou*, Xiao Wang**, Limao Xiong**, Han Xia**, Yingshuang Gu*, Mingxu Chai*, Fukang Zhu*, Caishuang Huang*, Shihan Dou*, Zhiheng Xi*, Rui Zheng*, Songyang Gao*, Yicheng Zou*, Hang Yan*, Yifan Le*, Ruohui Wang*, Lijun Li*, Jing Shao*, Tao Gui*, Qi Zhang*, Xuanjing Huang*
**EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models** This paper introduces EasyJailbreak, a unified framework designed to simplify the construction and evaluation of jailbreak attacks against Large Language Models (LLMs). The framework decomposes jailbreak methods into four components: Selector, Mutator, Constraint, and Evaluator. These components enable researchers to easily build and evaluate attacks by combining novel and existing components. EasyJailbreak supports 11 distinct jailbreak methods and has been validated across 10 LLMs, revealing a significant average breach probability of 60%. Notably, advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. The framework includes a web platform, PyPI package, screencast video, and experimental outputs, providing researchers with comprehensive resources for security validation. **Key Features:** - **Standardized Benchmarking:** Supports 12 jailbreak attacks, allowing for standardized benchmarking and comparison. - **Flexibility and Extensibility:** Modular architecture simplifies the assembly of existing attacks and lowers the development barrier for new attacks. - **Wide Model Compatibility:** Supports a variety of models, including open-source and closed-source models, and integrates with HuggingFace's transformers. **Evaluation:** - **Setup:** Utilized AdvBench dataset and a range of LLMs, including GPT-4, GPT-3.5-Turbo, LLaMA2, Vicuna, Qwen, InterLM, ChatGLM3, and Mistral. - **Attack Recipes:** Deployed several attack recipes for each type of jailbreak method, including human-design, long-tail distribution, and optimization strategies. - **Evaluation:** Used GenerativeJudge as a uniform evaluation method, with GPT-4-turbo-1106 as the scoring model. Results showed a 63% average breach probability across models, highlighting the need for enhanced security measures. **Conclusion:** EasyJailbreak is a significant advancement in securing LLMs against jailbreak attacks. Its unified, modular framework simplifies the evaluation and development of attack and defense strategies, demonstrating compatibility across a spectrum of models. The findings underscore the urgent need for enhanced security protocols to mitigate inherent risks in LLMs.**EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models** This paper introduces EasyJailbreak, a unified framework designed to simplify the construction and evaluation of jailbreak attacks against Large Language Models (LLMs). The framework decomposes jailbreak methods into four components: Selector, Mutator, Constraint, and Evaluator. These components enable researchers to easily build and evaluate attacks by combining novel and existing components. EasyJailbreak supports 11 distinct jailbreak methods and has been validated across 10 LLMs, revealing a significant average breach probability of 60%. Notably, advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. The framework includes a web platform, PyPI package, screencast video, and experimental outputs, providing researchers with comprehensive resources for security validation. **Key Features:** - **Standardized Benchmarking:** Supports 12 jailbreak attacks, allowing for standardized benchmarking and comparison. - **Flexibility and Extensibility:** Modular architecture simplifies the assembly of existing attacks and lowers the development barrier for new attacks. - **Wide Model Compatibility:** Supports a variety of models, including open-source and closed-source models, and integrates with HuggingFace's transformers. **Evaluation:** - **Setup:** Utilized AdvBench dataset and a range of LLMs, including GPT-4, GPT-3.5-Turbo, LLaMA2, Vicuna, Qwen, InterLM, ChatGLM3, and Mistral. - **Attack Recipes:** Deployed several attack recipes for each type of jailbreak method, including human-design, long-tail distribution, and optimization strategies. - **Evaluation:** Used GenerativeJudge as a uniform evaluation method, with GPT-4-turbo-1106 as the scoring model. Results showed a 63% average breach probability across models, highlighting the need for enhanced security measures. **Conclusion:** EasyJailbreak is a significant advancement in securing LLMs against jailbreak attacks. Its unified, modular framework simplifies the evaluation and development of attack and defense strategies, demonstrating compatibility across a spectrum of models. The findings underscore the urgent need for enhanced security protocols to mitigate inherent risks in LLMs.
Reach us at info@study.space
[slides] EasyJailbreak%3A A Unified Framework for Jailbreaking Large Language Models | StudySpace