2024 | Chawin Sitawarin, Norman Mu, David Wagner, Alexandre Araujo
This paper introduces the Proxy-Guided Attack on Large Language Models (PAL), a novel black-box attack that achieves high success rates in eliciting harmful responses from large language models (LLMs). PAL is the first optimization-based attack in a black-box query-only setting, leveraging a surrogate model to guide the optimization process and a sophisticated loss function tailored for real-world LLM APIs. The attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, significantly outperforming the current state of the art. Additionally, the paper proposes GCG++, an improved version of the GCG attack that achieves 94% ASR on white-box Llama-2-7B, and RAL, a simple yet effective baseline for black-box attacks.
The research highlights the vulnerability of LLMs to adversarial attacks, even after safety fine-tuning. The authors demonstrate that LLMs can be manipulated to generate harmful content through carefully crafted prompts, emphasizing the need for more robust safety testing and security guardrails. The techniques proposed in this work enable comprehensive safety testing of LLMs and contribute to the development of better security measures. The code for the proposed attacks is available at https://github.com/chawins/pal. The paper also discusses the challenges of evaluating LLMs in a black-box setting, including the limitations of existing APIs and the need for efficient loss computation. The authors propose techniques to overcome these challenges, such as using logit bias and heuristic candidate ranking to reduce query costs. The results show that PAL is highly effective in eliciting harmful responses from LLMs, even when the target string is not exactly matched. The paper also discusses the importance of selecting a good target string for jailbreaking tasks and the impact of format-aware targeting on attack success rates. Overall, the research underscores the need for improved safety and security measures for LLMs to prevent harmful outputs.This paper introduces the Proxy-Guided Attack on Large Language Models (PAL), a novel black-box attack that achieves high success rates in eliciting harmful responses from large language models (LLMs). PAL is the first optimization-based attack in a black-box query-only setting, leveraging a surrogate model to guide the optimization process and a sophisticated loss function tailored for real-world LLM APIs. The attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, significantly outperforming the current state of the art. Additionally, the paper proposes GCG++, an improved version of the GCG attack that achieves 94% ASR on white-box Llama-2-7B, and RAL, a simple yet effective baseline for black-box attacks.
The research highlights the vulnerability of LLMs to adversarial attacks, even after safety fine-tuning. The authors demonstrate that LLMs can be manipulated to generate harmful content through carefully crafted prompts, emphasizing the need for more robust safety testing and security guardrails. The techniques proposed in this work enable comprehensive safety testing of LLMs and contribute to the development of better security measures. The code for the proposed attacks is available at https://github.com/chawins/pal. The paper also discusses the challenges of evaluating LLMs in a black-box setting, including the limitations of existing APIs and the need for efficient loss computation. The authors propose techniques to overcome these challenges, such as using logit bias and heuristic candidate ranking to reduce query costs. The results show that PAL is highly effective in eliciting harmful responses from LLMs, even when the target string is not exactly matched. The paper also discusses the importance of selecting a good target string for jailbreaking tasks and the impact of format-aware targeting on attack success rates. Overall, the research underscores the need for improved safety and security measures for LLMs to prevent harmful outputs.