5 Jun 2024 | Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin
This paper presents improved techniques for optimization-based jailbreaking on large language models (LLMs). The authors propose an enhanced version of the Greedy Coordinate Gradient (GCG) attack, called I-GCG, which improves the efficiency and effectiveness of jailbreaking. The key improvements include using diverse target templates with harmful self-suggestions and guidance to mislead LLMs, an automatic multi-coordinate updating strategy to accelerate convergence, and an easy-to-hard initialization strategy to improve jailbreak efficiency. These techniques are combined to develop I-GCG, which achieves nearly 100% attack success rate on various LLMs. The experiments show that I-GCG outperforms state-of-the-art jailbreaking attacks, demonstrating its effectiveness in bypassing safety safeguards of LLMs. The results highlight the importance of improving jailbreak techniques to better understand and counteract the vulnerabilities of large language models.This paper presents improved techniques for optimization-based jailbreaking on large language models (LLMs). The authors propose an enhanced version of the Greedy Coordinate Gradient (GCG) attack, called I-GCG, which improves the efficiency and effectiveness of jailbreaking. The key improvements include using diverse target templates with harmful self-suggestions and guidance to mislead LLMs, an automatic multi-coordinate updating strategy to accelerate convergence, and an easy-to-hard initialization strategy to improve jailbreak efficiency. These techniques are combined to develop I-GCG, which achieves nearly 100% attack success rate on various LLMs. The experiments show that I-GCG outperforms state-of-the-art jailbreaking attacks, demonstrating its effectiveness in bypassing safety safeguards of LLMs. The results highlight the importance of improving jailbreak techniques to better understand and counteract the vulnerabilities of large language models.