21 Aug 2024 | Yichuan Mo, Yuji Wang, Zeming Wei, Yisen Wang
This paper addresses the issue of jailbreak attacks on Large Language Models (LLMs), which can bypass alignment and produce harmful content. The authors propose a novel approach called Prompt Adversarial Tuning (PAT), which trains a prompt control attached to user prompts to enhance robustness against such attacks. PAT optimizes the control prompt using both adversarial and benign prompts, maintaining natural performance while reducing the success rate of advanced attacks to nearly 0. The method is effective in both grey-box and black-box settings and incurs minimal computational overhead. Experiments on various models, including open-source and closed-source LLMs, demonstrate the effectiveness and transferability of PAT. The authors also discuss the limitations and broader impacts of their work, highlighting its potential to build more reliable and trustworthy LLMs.This paper addresses the issue of jailbreak attacks on Large Language Models (LLMs), which can bypass alignment and produce harmful content. The authors propose a novel approach called Prompt Adversarial Tuning (PAT), which trains a prompt control attached to user prompts to enhance robustness against such attacks. PAT optimizes the control prompt using both adversarial and benign prompts, maintaining natural performance while reducing the success rate of advanced attacks to nearly 0. The method is effective in both grey-box and black-box settings and incurs minimal computational overhead. Experiments on various models, including open-source and closed-source LLMs, demonstrate the effectiveness and transferability of PAT. The authors also discuss the limitations and broader impacts of their work, highlighting its potential to build more reliable and trustworthy LLMs.