15 Feb 2024 | Chawin Sitawarin, Norman Mu, David Wagner, Alexandre Araujo
The paper introduces the Proxy-Guided Attack on Large Language Models (PAL), a novel black-box optimization attack designed to elicit harmful responses from large language models (LLMs). PAL leverages a surrogate model to guide the optimization process and a sophisticated loss function tailored for real-world LLM APIs. The attack achieves an 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, significantly outperforming current state-of-the-art methods. The authors also propose GCG++, an improved white-box attack that achieves 94% ASR on Llama-2-7B, and RAL, a simple and effective black-box baseline with a 26% ASR on Llama-2-7B. The techniques introduced in this work aim to enhance the safety testing of LLMs and develop better security measures. The code for the attacks is available at <https://github.com/chawins/pal>.The paper introduces the Proxy-Guided Attack on Large Language Models (PAL), a novel black-box optimization attack designed to elicit harmful responses from large language models (LLMs). PAL leverages a surrogate model to guide the optimization process and a sophisticated loss function tailored for real-world LLM APIs. The attack achieves an 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, significantly outperforming current state-of-the-art methods. The authors also propose GCG++, an improved white-box attack that achieves 94% ASR on Llama-2-7B, and RAL, a simple and effective black-box baseline with a 26% ASR on Llama-2-7B. The techniques introduced in this work aim to enhance the safety testing of LLMs and develop better security measures. The code for the attacks is available at <https://github.com/chawins/pal>.