[slides and audio] Attacking Large Language Models with Projected Gradient Descent

The paper "Attacking Large Language Models with Projected Gradient Descent" addresses the issue of adversarial attacks on large language models (LLMs) that are often ineffective or computationally expensive. The authors propose a novel method called Projected Gradient Descent (PGD) for LLMs, which is a continuous relaxation of the discrete input prompt. This approach significantly reduces the computational cost compared to discrete optimization methods while achieving similar or better attack results. The key contributions of the paper include: 1. **Efficiency and Effectiveness**: PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization methods while achieving the same level of effectiveness. 2. **Continuous Relaxation**: The method uses a continuous relaxation of the one-hot encoding, allowing for more flexible and efficient optimization. 3. **Error Control**: The authors introduce entropy projection to control the error introduced by the continuous relaxation, enhancing the effectiveness of the attack. 4. **Flexibility**: The attack can smoothly insert or remove tokens, providing more flexibility in the perturbations. The paper also includes experimental results demonstrating the effectiveness of PGD on various LLMs, such as Vicuna 1.3 7B, Falcon 7B, and Gemma 2B. The results show that PGD outperforms both gradient-based attacks (GBDA) and discrete optimization methods (GCG) in terms of both speed and attack success rate. The authors discuss the limitations and future directions, emphasizing the importance of efficient and effective attacks for advancing LLM alignment and evaluation.The paper "Attacking Large Language Models with Projected Gradient Descent" addresses the issue of adversarial attacks on large language models (LLMs) that are often ineffective or computationally expensive. The authors propose a novel method called Projected Gradient Descent (PGD) for LLMs, which is a continuous relaxation of the discrete input prompt. This approach significantly reduces the computational cost compared to discrete optimization methods while achieving similar or better attack results. The key contributions of the paper include: 1. **Efficiency and Effectiveness**: PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization methods while achieving the same level of effectiveness. 2. **Continuous Relaxation**: The method uses a continuous relaxation of the one-hot encoding, allowing for more flexible and efficient optimization. 3. **Error Control**: The authors introduce entropy projection to control the error introduced by the continuous relaxation, enhancing the effectiveness of the attack. 4. **Flexibility**: The attack can smoothly insert or remove tokens, providing more flexibility in the perturbations. The paper also includes experimental results demonstrating the effectiveness of PGD on various LLMs, such as Vicuna 1.3 7B, Falcon 7B, and Gemma 2B. The results show that PGD outperforms both gradient-based attacks (GBDA) and discrete optimization methods (GCG) in terms of both speed and attack success rate. The authors discuss the limitations and future directions, emphasizing the importance of efficient and effective attacks for advancing LLM alignment and evaluation.

Attacking Large Language Models with Projected Gradient Descent

3 Mar 2025 | Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, Stephan Günnemann