[slides and audio] Boosting Jailbreak Attack with Momentum

The paper "Boosting Jailbreak Attack with Momentum" by Yihao Zhang and Zeming Wei from Peking University addresses the vulnerability of Large Language Models (LLMs) to adversarial attacks, particularly the jailbreak attack. The authors introduce the Momentum Accelerated GCG (MAC) attack, which incorporates a momentum term into the gradient heuristic to optimize adversarial prompts more efficiently. This approach stabilizes the optimization process and enhances the attack's effectiveness and efficiency. The MAC attack is designed to overcome the efficiency bottleneck of the Greedy Coordinate Gradient (GCG) attack, which requires a large number of optimization steps. By adding a momentum term, the MAC attack dynamically adjusts the adversarial suffix after each forward-backward pass, improving stability across different prompts. Experimental results on the vicuna-7b model show that MAC achieves a higher attack success rate (ASR) of 48.6% with only 20 steps, compared to 38.1% with GCG. The paper also discusses the limitations of the current work, such as the batch size being limited to 1 in multiple prompt attacks and the need for further evaluation on more models. The authors acknowledge the importance of developing efficient white-box attacks for developers to evaluate and red-team LLMs effectively. Overall, the MAC attack provides a novel technique to accelerate and enhance the effectiveness of jailbreak attacks on aligned language models, offering deeper insights into the safety evaluations of AI systems.The paper "Boosting Jailbreak Attack with Momentum" by Yihao Zhang and Zeming Wei from Peking University addresses the vulnerability of Large Language Models (LLMs) to adversarial attacks, particularly the jailbreak attack. The authors introduce the Momentum Accelerated GCG (MAC) attack, which incorporates a momentum term into the gradient heuristic to optimize adversarial prompts more efficiently. This approach stabilizes the optimization process and enhances the attack's effectiveness and efficiency. The MAC attack is designed to overcome the efficiency bottleneck of the Greedy Coordinate Gradient (GCG) attack, which requires a large number of optimization steps. By adding a momentum term, the MAC attack dynamically adjusts the adversarial suffix after each forward-backward pass, improving stability across different prompts. Experimental results on the vicuna-7b model show that MAC achieves a higher attack success rate (ASR) of 48.6% with only 20 steps, compared to 38.1% with GCG. The paper also discusses the limitations of the current work, such as the batch size being limited to 1 in multiple prompt attacks and the need for further evaluation on more models. The authors acknowledge the importance of developing efficient white-box attacks for developers to evaluate and red-team LLMs effectively. Overall, the MAC attack provides a novel technique to accelerate and enhance the effectiveness of jailbreak attacks on aligned language models, offering deeper insights into the safety evaluations of AI systems.

BOOSTING JAILBREAK ATTACK WITH MOMENTUM

2 May 2024 | Yihao Zhang, Zeming Wei