Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

8 Jul 2024 | Andy Zhou, Bo Li, Haohan Wang
The paper "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks" addresses the vulnerability of large language models (LLMs) to adversarial attacks, known as jailbreaking, where adversaries modify prompts to induce unwanted behavior. The authors propose an optimization-based objective and an algorithm called Robust Prompt Optimization (RPO) to create robust system-level defenses. RPO directly incorporates the adversary into the defensive objective and optimizes a lightweight, transferable suffix to adapt to worst-case adaptive attacks. The approach is evaluated on two red-teaming benchmarks, JailbreakBench and HarmBench, showing improved robustness to both known and unknown jailbreaks, with attack success rates (ASR) reduced to 6% on GPT-4 and 0% on Llama-2. The paper also provides theoretical and experimental results demonstrating the effectiveness and practicality of RPO, including its ability to generalize to new attacks and maintain benign usage.The paper "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks" addresses the vulnerability of large language models (LLMs) to adversarial attacks, known as jailbreaking, where adversaries modify prompts to induce unwanted behavior. The authors propose an optimization-based objective and an algorithm called Robust Prompt Optimization (RPO) to create robust system-level defenses. RPO directly incorporates the adversary into the defensive objective and optimizes a lightweight, transferable suffix to adapt to worst-case adaptive attacks. The approach is evaluated on two red-teaming benchmarks, JailbreakBench and HarmBench, showing improved robustness to both known and unknown jailbreaks, with attack success rates (ASR) reduced to 6% on GPT-4 and 0% on Llama-2. The paper also provides theoretical and experimental results demonstrating the effectiveness and practicality of RPO, including its ability to generalize to new attacks and maintain benign usage.
Reach us at info@study.space
[slides and audio] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks