Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

30 May 2024 | Chen Xiong, Xiangyu Qi, Pin-Yu Chen, Tsung-Yi Ho
The paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism to protect large language models (LLMs) against jailbreak attacks. DPP is designed to minimize the Attack Success Rate (ASR) while preserving high utility, outperforming existing defense strategies in balancing safety and functionality. The method uses interpretable suffix prompts to effectively counter a wide range of standard and adaptive jailbreak techniques. Empirical results on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models show significant reductions in ASR with minimal impact on utility. DPP is robust against adaptive jailbreaking attacks and provides interpretable, scalable solutions applicable to various LLM platforms. The algorithm employs a Hierarchical Genetic Algorithm to iteratively refine suffix prompts, achieving a low ASR of 3.8% on LLAMA-2-7B-Chat and 2.0% on Mistral-7B-Instruct-v0.2. The method also demonstrates strong generalization across different LLMs and unforeseen jailbreak queries. DPP's interpretability is validated through analysis of its suffix prompts, showing enhanced clarity compared to existing prompt-based defenses. The approach is evaluated against multiple jailbreak attacks, including GCG, AutoDAN, PAIR, TAP, ICA, and Catastrophic, with DPP consistently achieving the lowest ASR and highest utility. The paper also includes ablation studies showing that DPP performs better as a suffix than as a prefix, and that it maintains strong defense performance even on less-aligned models. Overall, DPP is a scalable, interpretable, and effective defense mechanism for LLMs against jailbreak attacks.The paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism to protect large language models (LLMs) against jailbreak attacks. DPP is designed to minimize the Attack Success Rate (ASR) while preserving high utility, outperforming existing defense strategies in balancing safety and functionality. The method uses interpretable suffix prompts to effectively counter a wide range of standard and adaptive jailbreak techniques. Empirical results on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models show significant reductions in ASR with minimal impact on utility. DPP is robust against adaptive jailbreaking attacks and provides interpretable, scalable solutions applicable to various LLM platforms. The algorithm employs a Hierarchical Genetic Algorithm to iteratively refine suffix prompts, achieving a low ASR of 3.8% on LLAMA-2-7B-Chat and 2.0% on Mistral-7B-Instruct-v0.2. The method also demonstrates strong generalization across different LLMs and unforeseen jailbreak queries. DPP's interpretability is validated through analysis of its suffix prompts, showing enhanced clarity compared to existing prompt-based defenses. The approach is evaluated against multiple jailbreak attacks, including GCG, AutoDAN, PAIR, TAP, ICA, and Catastrophic, with DPP consistently achieving the lowest ASR and highest utility. The paper also includes ablation studies showing that DPP performs better as a suffix than as a prefix, and that it maintains strong defense performance even on less-aligned models. Overall, DPP is a scalable, interpretable, and effective defense mechanism for LLMs against jailbreak attacks.
Reach us at info@study.space
[slides and audio] Defensive Prompt Patch%3A A Robust and Interpretable Defense of LLMs against Jailbreak Attacks