EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

18 Mar 2024 | Weikang Zhou, Xiao Wang, Limiao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao, Tao Gui, Qi Zhang, Xuanjing Huang
EasyJailbreak is a unified framework for jailbreaking large language models (LLMs), designed to simplify the construction and evaluation of jailbreak attacks. It consists of four core components: Selector, Mutator, Constraint, and Evaluator. The Selector identifies the most effective jailbreak inputs, the Mutator refines prompts to bypass safeguards, the Constraint filters out ineffective inputs, and the Evaluator assesses the success of each attack. EasyJailbreak supports 11 jailbreak methods and is compatible with various models, including open-source and closed-source ones like LLaMA2 and GPT-4. Evaluations on 10 LLMs revealed an average breach probability of 60%, with advanced models like GPT-3.5-Turbo and GPT-4 having ASRs of 57% and 33%, respectively. The framework provides resources for researchers, including a web platform, PyPI package, and experimental outputs. It also enables comprehensive security evaluations and highlights the need for improved security measures in LLMs. The framework's modular design allows for flexibility and extensibility, enabling researchers to focus on creating unique components while leveraging the framework for efficient development. EasyJailbreak's evaluation of different jailbreak methods on various LLMs shows that increasing model size does not necessarily improve security. The framework also includes evaluators that balance accuracy and efficiency, with GPT-4 leading in accuracy but having longer processing times, while the Gptfuzz classifier offers high efficiency and accuracy. The study underscores the urgent need for enhanced security protocols to mitigate inherent risks in LLMs.EasyJailbreak is a unified framework for jailbreaking large language models (LLMs), designed to simplify the construction and evaluation of jailbreak attacks. It consists of four core components: Selector, Mutator, Constraint, and Evaluator. The Selector identifies the most effective jailbreak inputs, the Mutator refines prompts to bypass safeguards, the Constraint filters out ineffective inputs, and the Evaluator assesses the success of each attack. EasyJailbreak supports 11 jailbreak methods and is compatible with various models, including open-source and closed-source ones like LLaMA2 and GPT-4. Evaluations on 10 LLMs revealed an average breach probability of 60%, with advanced models like GPT-3.5-Turbo and GPT-4 having ASRs of 57% and 33%, respectively. The framework provides resources for researchers, including a web platform, PyPI package, and experimental outputs. It also enables comprehensive security evaluations and highlights the need for improved security measures in LLMs. The framework's modular design allows for flexibility and extensibility, enabling researchers to focus on creating unique components while leveraging the framework for efficient development. EasyJailbreak's evaluation of different jailbreak methods on various LLMs shows that increasing model size does not necessarily improve security. The framework also includes evaluators that balance accuracy and efficiency, with GPT-4 leading in accuracy but having longer processing times, while the Gptfuzz classifier offers high efficiency and accuracy. The study underscores the urgent need for enhanced security protocols to mitigate inherent risks in LLMs.
Reach us at info@study.space