30 May 2024 | Chen Xiong, Xiangyu Qi, Pin-Yu Chen, Tsung-Yi Ho
The paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism designed to protect large language models (LLMs) from jailbreak attacks. These attacks aim to circumvent the models' safety and security mechanisms by introducing malicious prompts. Unlike previous approaches that often compromise model utility for safety, DPP aims to achieve a minimal Attack Success Rate (ASR) while preserving high utility. The method uses interpretable suffix prompts to effectively counter a wide range of standard and adaptive jailbreak techniques. Empirical results on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. DPP outperforms existing defense strategies in balancing safety and functionality and provides a scalable and interpretable solution applicable to various LLM platforms. The paper also includes a detailed methodology, experimental setup, and evaluation metrics, highlighting the effectiveness of DPP in defending against jailbreak attacks.The paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism designed to protect large language models (LLMs) from jailbreak attacks. These attacks aim to circumvent the models' safety and security mechanisms by introducing malicious prompts. Unlike previous approaches that often compromise model utility for safety, DPP aims to achieve a minimal Attack Success Rate (ASR) while preserving high utility. The method uses interpretable suffix prompts to effectively counter a wide range of standard and adaptive jailbreak techniques. Empirical results on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. DPP outperforms existing defense strategies in balancing safety and functionality and provides a scalable and interpretable solution applicable to various LLM platforms. The paper also includes a detailed methodology, experimental setup, and evaluation metrics, highlighting the effectiveness of DPP in defending against jailbreak attacks.