Attacking Large Language Models with Projected Gradient Descent

Attacking Large Language Models with Projected Gradient Descent

2024 | Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, Stephan Gümennmann
This paper presents a novel approach to attacking large language models (LLMs) using Projected Gradient Descent (PGD) on continuously relaxed input prompts. The authors show that by carefully controlling the error introduced by the continuous relaxation, PGD can achieve the same effectiveness as discrete optimization methods but with significantly lower computational costs. Their PGD method is up to one order of magnitude faster than state-of-the-art discrete optimization techniques while achieving the same devastating attack results. This makes it more efficient for large-scale evaluation and adversarial training. The paper introduces a continuous relaxation of the one-hot encoding of tokens, allowing for gradient-based optimization on a sequence of L T-dimensional simplices. This relaxation enables the method to find discrete solutions more efficiently, as the projection back on the simplex naturally yields sparse solutions. Additionally, the authors introduce an entropy projection to control the error introduced by the relaxation, using the Gini index as a measure of entropy. The authors also introduce a flexible sequence length approach, allowing for the smooth insertion or removal of tokens. This flexibility is achieved by introducing a mask that controls the insertion of tokens into the attention operation. The method is implemented with a combination of simplex and entropy projections, and is tested on several LLMs including Vicuna 1.3 7B, Falcon 7B, Falcon 7B Instruct, Llama3, and Gemma 2B and 7B. The experiments show that the PGD method outperforms gradient-based attacks like GBDA and discrete optimization methods like GCG in terms of effectiveness and efficiency. The method achieves high success rates in "behavior" jailbreaking tasks, where the goal is to make the LLM respond in a way that violates its alignment with the system prompt. The method is also effective in an obedience task, where the goal is to make the LLM avoid using a specific word. The paper also discusses the importance of efficient adversarial attacks for evaluating and improving the alignment of LLMs. The authors highlight the trade-off between computational cost and effectiveness in automatic red teaming, and argue that the continuous relaxation approach is a practical choice for encoder-only LLMs. The entropy projection is a novel strategy for opposing the introduced relaxation error. The paper concludes that PGD, the default choice for generating adversarial perturbations in other domains, can also be very effective and efficient for LLMs. The method achieves the same attack strength as GCG up to one order of magnitude faster. The performance of the PGD method stands in contrast to previous ordinary gradient-based optimization methods like GBDA, which are virtually unable to fool aligned LLMs. However, with more advanced measures of attack success rate, GCG is slightly superior in some cases. The authors suggest that future work should focus on improving the consistency between the objective used in the optimization and the measurement of real attack success.This paper presents a novel approach to attacking large language models (LLMs) using Projected Gradient Descent (PGD) on continuously relaxed input prompts. The authors show that by carefully controlling the error introduced by the continuous relaxation, PGD can achieve the same effectiveness as discrete optimization methods but with significantly lower computational costs. Their PGD method is up to one order of magnitude faster than state-of-the-art discrete optimization techniques while achieving the same devastating attack results. This makes it more efficient for large-scale evaluation and adversarial training. The paper introduces a continuous relaxation of the one-hot encoding of tokens, allowing for gradient-based optimization on a sequence of L T-dimensional simplices. This relaxation enables the method to find discrete solutions more efficiently, as the projection back on the simplex naturally yields sparse solutions. Additionally, the authors introduce an entropy projection to control the error introduced by the relaxation, using the Gini index as a measure of entropy. The authors also introduce a flexible sequence length approach, allowing for the smooth insertion or removal of tokens. This flexibility is achieved by introducing a mask that controls the insertion of tokens into the attention operation. The method is implemented with a combination of simplex and entropy projections, and is tested on several LLMs including Vicuna 1.3 7B, Falcon 7B, Falcon 7B Instruct, Llama3, and Gemma 2B and 7B. The experiments show that the PGD method outperforms gradient-based attacks like GBDA and discrete optimization methods like GCG in terms of effectiveness and efficiency. The method achieves high success rates in "behavior" jailbreaking tasks, where the goal is to make the LLM respond in a way that violates its alignment with the system prompt. The method is also effective in an obedience task, where the goal is to make the LLM avoid using a specific word. The paper also discusses the importance of efficient adversarial attacks for evaluating and improving the alignment of LLMs. The authors highlight the trade-off between computational cost and effectiveness in automatic red teaming, and argue that the continuous relaxation approach is a practical choice for encoder-only LLMs. The entropy projection is a novel strategy for opposing the introduced relaxation error. The paper concludes that PGD, the default choice for generating adversarial perturbations in other domains, can also be very effective and efficient for LLMs. The method achieves the same attack strength as GCG up to one order of magnitude faster. The performance of the PGD method stands in contrast to previous ordinary gradient-based optimization methods like GBDA, which are virtually unable to fool aligned LLMs. However, with more advanced measures of attack success rate, GCG is slightly superior in some cases. The authors suggest that future work should focus on improving the consistency between the objective used in the optimization and the measurement of real attack success.
Reach us at info@futurestudyspace.com
[slides] Attacking Large Language Models with Projected Gradient Descent | StudySpace