Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

3 Jun 2024 | Xiaosen Zheng*1, Tianyu Pang12, Chao Du2, Qian Liu2, Jing Jiang11, Min Lin2
The paper "Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses" by Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin introduces an improved technique for few-shot jailbreaking (I-FSJ) of large language models (LLMs). The authors propose three main strategies to enhance the effectiveness of I-FSJ: 1. **Constructing a Demo Pool**: They create a pool of harmful responses from "helpful-inclined" models like Mistral-7B, which are not specifically safety-aligned. 2. **Injecting Special Tokens**: They inject special tokens from the target LLM's system prompt, such as [/INST] in Llama-2-7B-Chat, into the generated demos to exploit the model's tendency to generate content when presented with these tokens. 3. **Demo-Level Random Search**: They apply demo-level random search to optimize the attacking loss, modifying the random search algorithm to be more efficient and robust. The I-FSJ method achieves high attack success rates (ASRs) on aligned LLMs, including Llama-2-7B and Llama-3-8B, even when these models are enhanced with advanced defenses like perplexity detection and SmoothLLM. The authors demonstrate the effectiveness of I-FSJ through comprehensive evaluations, showing that it consistently achieves nearly 100% ASRs on various aligned LLMs and advanced defenses. The method is automated, eliminating the need for extensive human labor, and serves as a strong baseline for future research in jailbreaking attacks.The paper "Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses" by Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin introduces an improved technique for few-shot jailbreaking (I-FSJ) of large language models (LLMs). The authors propose three main strategies to enhance the effectiveness of I-FSJ: 1. **Constructing a Demo Pool**: They create a pool of harmful responses from "helpful-inclined" models like Mistral-7B, which are not specifically safety-aligned. 2. **Injecting Special Tokens**: They inject special tokens from the target LLM's system prompt, such as [/INST] in Llama-2-7B-Chat, into the generated demos to exploit the model's tendency to generate content when presented with these tokens. 3. **Demo-Level Random Search**: They apply demo-level random search to optimize the attacking loss, modifying the random search algorithm to be more efficient and robust. The I-FSJ method achieves high attack success rates (ASRs) on aligned LLMs, including Llama-2-7B and Llama-3-8B, even when these models are enhanced with advanced defenses like perplexity detection and SmoothLLM. The authors demonstrate the effectiveness of I-FSJ through comprehensive evaluations, showing that it consistently achieves nearly 100% ASRs on various aligned LLMs and advanced defenses. The method is automated, eliminating the need for extensive human labor, and serves as a strong baseline for future research in jailbreaking attacks.
Reach us at info@study.space
Understanding Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses