Boosting Jailbreak Attack with Momentum

Boosting Jailbreak Attack with Momentum

2 May 2024 | Yihao Zhang, Zeming Wei
This paper introduces the Momentum Accelerated GCG (MAC) attack, a novel gradient-based method for jailbreaking aligned language models (LLMs). The MAC attack improves upon the Greedy Coordinate Gradient (GCG) attack by incorporating a momentum term into the gradient heuristic, which enhances the optimization process and leads to faster convergence and higher attack success rates. The MAC attack is designed to stabilize the optimization process and harness more heuristic insights from previous iterations, making it more efficient and effective than the original GCG attack. The MAC attack is evaluated on the Vicuna-7B model, demonstrating significant improvements in attack success rate and efficiency. For individual prompt attacks, MAC achieves a higher multiple attack success rate (ASR) of 48.6% with only 20 steps, compared to 38.1% for the vanilla GCG attack. For multiple prompt attacks, MAC shows improved generalization ability, with higher Maximum ASR and lower standard deviation, indicating enhanced efficiency and robustness. The paper also discusses the limitations of the current research, including the need for further exploration of other optimization methods and the importance of parameter tuning in adversarial contexts. The MAC attack provides a new technique to accelerate jailbreak attacks on aligned language models and offers new insights into the safety evaluations of AI systems. The code for the MAC attack is available at https://github.com/weizeming/momentum-attack-llm.This paper introduces the Momentum Accelerated GCG (MAC) attack, a novel gradient-based method for jailbreaking aligned language models (LLMs). The MAC attack improves upon the Greedy Coordinate Gradient (GCG) attack by incorporating a momentum term into the gradient heuristic, which enhances the optimization process and leads to faster convergence and higher attack success rates. The MAC attack is designed to stabilize the optimization process and harness more heuristic insights from previous iterations, making it more efficient and effective than the original GCG attack. The MAC attack is evaluated on the Vicuna-7B model, demonstrating significant improvements in attack success rate and efficiency. For individual prompt attacks, MAC achieves a higher multiple attack success rate (ASR) of 48.6% with only 20 steps, compared to 38.1% for the vanilla GCG attack. For multiple prompt attacks, MAC shows improved generalization ability, with higher Maximum ASR and lower standard deviation, indicating enhanced efficiency and robustness. The paper also discusses the limitations of the current research, including the need for further exploration of other optimization methods and the importance of parameter tuning in adversarial contexts. The MAC attack provides a new technique to accelerate jailbreak attacks on aligned language models and offers new insights into the safety evaluations of AI systems. The code for the MAC attack is available at https://github.com/weizeming/momentum-attack-llm.
Reach us at info@study.space
[slides] Boosting Jailbreak Attack with Momentum | StudySpace