Understanding Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

This paper addresses the challenge of optimizing jailbreaking techniques for large language models (LLMs). The authors propose several improvements to the Greedy Coordinate Gradient (GCG) attack, a significant milestone in optimization-based jailbreaking. The key contributions include: 1. **Diverse Target Templates**: The authors introduce harmful self-suggestion and guidance in the optimization goal to mislead LLMs, enhancing the effectiveness of the jailbreak. 2. **Automatic Multi-Coordinate Updating Strategy**: This strategy adaptively decides how many tokens to replace in each step, accelerating convergence and improving performance. 3. **Easy-to-Hard Initialization**: This technique generates jailbreak suffixes for simple harmful requests first, then uses these suffixes as initialization for more challenging requests, improving efficiency. The combined method, named $\mathcal{I}$-GCG, is evaluated on various benchmarks, including the AdvBench and NeurIPS 2023 Red Teaming Track. The results demonstrate that $\mathcal{I}$-GCG achieves nearly 100% attack success rate across all models, outperforming state-of-the-art jailbreak methods. The paper also discusses the impact of the proposed techniques, their transferability, and limitations, emphasizing the need for further research to enhance LLMs' human preference safeguards and defense approaches.This paper addresses the challenge of optimizing jailbreaking techniques for large language models (LLMs). The authors propose several improvements to the Greedy Coordinate Gradient (GCG) attack, a significant milestone in optimization-based jailbreaking. The key contributions include: 1. **Diverse Target Templates**: The authors introduce harmful self-suggestion and guidance in the optimization goal to mislead LLMs, enhancing the effectiveness of the jailbreak. 2. **Automatic Multi-Coordinate Updating Strategy**: This strategy adaptively decides how many tokens to replace in each step, accelerating convergence and improving performance. 3. **Easy-to-Hard Initialization**: This technique generates jailbreak suffixes for simple harmful requests first, then uses these suffixes as initialization for more challenging requests, improving efficiency. The combined method, named $\mathcal{I}$-GCG, is evaluated on various benchmarks, including the AdvBench and NeurIPS 2023 Red Teaming Track. The results demonstrate that $\mathcal{I}$-GCG achieves nearly 100% attack success rate across all models, outperforming state-of-the-art jailbreak methods. The paper also discusses the impact of the proposed techniques, their transferability, and limitations, emphasizing the need for further research to enhance LLMs' human preference safeguards and defense approaches.

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

5 Jun 2024 | Xiaojun Jia1, Tianyu Pang2, Chao Du2, Yihao Huang1, Jindong Gu3, Yang Liu1, Xiaochun Cao4, Min Lin2