3 Jun 2024 | Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin
This paper presents an improved few-shot jailbreaking (I-FSJ) method that effectively bypasses aligned language models (LLMs) and their defenses. The authors propose techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool to enhance the efficiency of jailbreaking. These methods achieve high attack success rates (ASRs) on Llama-2-7B and Llama-3-8B, even with advanced defenses like perplexity detection and SmoothLLM. The I-FSJ method is fully automated, eliminating the need for human labor and serving as a strong baseline for future research on jailbreaking attacks. The paper also evaluates the effectiveness of I-FSJ against various defenses, demonstrating its robustness and high ASRs across different settings. The results show that I-FSJ can achieve over 95% ASRs on aligned LLMs, even when faced with perturbation-based defenses like SmoothLLM. The method is also effective against models with strong alignment, such as Llama-2-7B-Chat and Llama-3-8B-Instruct, demonstrating the effectiveness of the proposed techniques. The paper highlights the significant vulnerabilities in current alignment methods and the need for improved and more resilient alignment strategies in the development of LLMs.This paper presents an improved few-shot jailbreaking (I-FSJ) method that effectively bypasses aligned language models (LLMs) and their defenses. The authors propose techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool to enhance the efficiency of jailbreaking. These methods achieve high attack success rates (ASRs) on Llama-2-7B and Llama-3-8B, even with advanced defenses like perplexity detection and SmoothLLM. The I-FSJ method is fully automated, eliminating the need for human labor and serving as a strong baseline for future research on jailbreaking attacks. The paper also evaluates the effectiveness of I-FSJ against various defenses, demonstrating its robustness and high ASRs across different settings. The results show that I-FSJ can achieve over 95% ASRs on aligned LLMs, even when faced with perturbation-based defenses like SmoothLLM. The method is also effective against models with strong alignment, such as Llama-2-7B-Chat and Llama-3-8B-Instruct, demonstrating the effectiveness of the proposed techniques. The paper highlights the significant vulnerabilities in current alignment methods and the need for improved and more resilient alignment strategies in the development of LLMs.