Comprehensive Assessment of Jailbreak Attacks Against LLMs

Comprehensive Assessment of Jailbreak Attacks Against LLMs

8 Feb 2024 | Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang
This article presents a comprehensive assessment of jailbreak attacks against large language models (LLMs). The study evaluates 13 cutting-edge jailbreak methods across four categories: human-based, obfuscation-based, optimization-based, and parameter-based. The researchers analyze the effectiveness of these methods on six popular LLMs, including open-source models like ChatGLM3, Llama2, and Vicuna, as well as closed-source models like GPT-3.5, GPT-4, and PaLM2. The study also constructs a forbidden question dataset based on 16 violation categories derived from the policies of major LLM providers. The results show that optimization-based and parameter-based jailbreak methods achieve the highest attack success rates (ASR), while human-based methods also demonstrate strong effectiveness, particularly in black-box scenarios. The study highlights the challenges of aligning LLM policies with safety mechanisms and the need for robust countermeasures against jailbreak attacks. It also discusses the trade-off between attack performance and efficiency, as well as the transferability of jailbreak prompts across different LLMs. The research provides a benchmark for evaluating jailbreak methods and offers insights for future research on this topic. The findings emphasize the importance of systematically assessing jailbreak techniques to improve the security and ethical alignment of LLMs.This article presents a comprehensive assessment of jailbreak attacks against large language models (LLMs). The study evaluates 13 cutting-edge jailbreak methods across four categories: human-based, obfuscation-based, optimization-based, and parameter-based. The researchers analyze the effectiveness of these methods on six popular LLMs, including open-source models like ChatGLM3, Llama2, and Vicuna, as well as closed-source models like GPT-3.5, GPT-4, and PaLM2. The study also constructs a forbidden question dataset based on 16 violation categories derived from the policies of major LLM providers. The results show that optimization-based and parameter-based jailbreak methods achieve the highest attack success rates (ASR), while human-based methods also demonstrate strong effectiveness, particularly in black-box scenarios. The study highlights the challenges of aligning LLM policies with safety mechanisms and the need for robust countermeasures against jailbreak attacks. It also discusses the trade-off between attack performance and efficiency, as well as the transferability of jailbreak prompts across different LLMs. The research provides a benchmark for evaluating jailbreak methods and offers insights for future research on this topic. The findings emphasize the importance of systematically assessing jailbreak techniques to improve the security and ethical alignment of LLMs.
Reach us at info@study.space