8 Feb 2024 | Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang
This paper presents a comprehensive assessment of jailbreak attacks against large language models (LLMs). The authors evaluate 13 state-of-the-art jailbreak methods across four categories: human-based, obfuscation-based, optimization-based, and parameter-based. They also analyze 160 questions from 16 violation categories based on a unified policy derived from the latest usage policies of five leading LLM-related service providers. The study focuses on six popular LLMs, including open-source models like ChatGLM3, Llama2, and Vicuna, as well as closed-source models like GPT-3.5, GPT-4, and PaLM2.
The authors find that optimization-based and parameter-based jailbreak attacks achieve relatively high attack success rates (ASR) across different LLMs. However, when considering both performance and effectiveness, parameter-based attacks perform the best. Human-based jailbreak attacks, which do not require modification of the original question, are also effective in many cases. Obfuscation-based jailbreak attacks are model-specific and rely on the strong capabilities of LLMs.
The study also evaluates the effectiveness of different jailbreak techniques in terms of attack performance, efficiency, and transferability. The results show that optimization-based and parameter-based jailbreak attacks are more robust and versatile across different violation categories. Additionally, the transferability of jailbreak attacks is still viable, making them an option for black-box models.
The authors also discuss the trade-off between attack performance and efficiency, as well as the transferability of jailbreak prompts. They find that the attack success rate across all violation categories that are explicitly stated by model providers remains high, indicating the challenges of effectively aligning LLM policies and the ability to counter jailbreak attacks.
The study highlights the necessity of evaluating different jailbreak methods and provides insights for future research on jailbreak attacks. It also serves as a benchmark tool for evaluating these methods for researchers and practitioners. The authors emphasize the importance of collecting and analyzing jailbreak prompts to improve the security of LLMs.This paper presents a comprehensive assessment of jailbreak attacks against large language models (LLMs). The authors evaluate 13 state-of-the-art jailbreak methods across four categories: human-based, obfuscation-based, optimization-based, and parameter-based. They also analyze 160 questions from 16 violation categories based on a unified policy derived from the latest usage policies of five leading LLM-related service providers. The study focuses on six popular LLMs, including open-source models like ChatGLM3, Llama2, and Vicuna, as well as closed-source models like GPT-3.5, GPT-4, and PaLM2.
The authors find that optimization-based and parameter-based jailbreak attacks achieve relatively high attack success rates (ASR) across different LLMs. However, when considering both performance and effectiveness, parameter-based attacks perform the best. Human-based jailbreak attacks, which do not require modification of the original question, are also effective in many cases. Obfuscation-based jailbreak attacks are model-specific and rely on the strong capabilities of LLMs.
The study also evaluates the effectiveness of different jailbreak techniques in terms of attack performance, efficiency, and transferability. The results show that optimization-based and parameter-based jailbreak attacks are more robust and versatile across different violation categories. Additionally, the transferability of jailbreak attacks is still viable, making them an option for black-box models.
The authors also discuss the trade-off between attack performance and efficiency, as well as the transferability of jailbreak prompts. They find that the attack success rate across all violation categories that are explicitly stated by model providers remains high, indicating the challenges of effectively aligning LLM policies and the ability to counter jailbreak attacks.
The study highlights the necessity of evaluating different jailbreak methods and provides insights for future research on jailbreak attacks. It also serves as a benchmark tool for evaluating these methods for researchers and practitioners. The authors emphasize the importance of collecting and analyzing jailbreak prompts to improve the security of LLMs.