Intention Analysis Makes LLMs A Good Jailbreak Defender

Intention Analysis Makes LLMs A Good Jailbreak Defender

29 Apr 2024 | Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao
This paper introduces Intention Analysis (IA), a simple yet effective defense strategy for large language models (LLMs) against jailbreak attacks. IA leverages the inherent ability of LLMs to recognize user intent, enabling them to identify harmful or malicious queries before generating responses. The method consists of two stages: (1) essential intention analysis, where the LLM identifies the core intent of the user query with a focus on safety, ethics, and legality; and (2) policy-aligned response, where the LLM generates a response aligned with safety policies based on the identified intent. IA is an inference-only method, meaning it does not require additional training, and it significantly enhances LLM safety without compromising their helpfulness. Extensive experiments on various jailbreak benchmarks, including ChatGLM, LLaMA2, Vicuna, MPT, DeepSeek, and GPT-3.5, show that IA consistently and significantly reduces the harmfulness of LLM responses, with an average reduction of 53.1% in attack success rate. IA also maintains the general helpfulness of LLMs, outperforming other defense methods in terms of safety improvement. For example, Vicuna-7B even outperforms GPT-3.5 in attack success rate when using IA. The method is effective against complex and stealthy jailbreak attacks, including multilingual and encryption-based attacks. IA maintains the helpfulness of LLMs on harmless queries and ensures the safety of responses to harmful queries by providing detailed explanations for refusals. The method is robust to different IA prompts and can be applied to various LLMs, including both open-source and closed-source models. The paper also discusses the limitations of the method, including the exclusion of GPT-4 from experiments due to budget constraints and the need for further testing to validate the practical applicability of IA in real-world scenarios. The research highlights the importance of intention analysis in improving LLM safety and suggests future work focusing on integrating this into training to reduce inference costs. The study emphasizes the need for more effective and robust defense strategies for LLMs in the face of rapidly advancing adversarial attacks.This paper introduces Intention Analysis (IA), a simple yet effective defense strategy for large language models (LLMs) against jailbreak attacks. IA leverages the inherent ability of LLMs to recognize user intent, enabling them to identify harmful or malicious queries before generating responses. The method consists of two stages: (1) essential intention analysis, where the LLM identifies the core intent of the user query with a focus on safety, ethics, and legality; and (2) policy-aligned response, where the LLM generates a response aligned with safety policies based on the identified intent. IA is an inference-only method, meaning it does not require additional training, and it significantly enhances LLM safety without compromising their helpfulness. Extensive experiments on various jailbreak benchmarks, including ChatGLM, LLaMA2, Vicuna, MPT, DeepSeek, and GPT-3.5, show that IA consistently and significantly reduces the harmfulness of LLM responses, with an average reduction of 53.1% in attack success rate. IA also maintains the general helpfulness of LLMs, outperforming other defense methods in terms of safety improvement. For example, Vicuna-7B even outperforms GPT-3.5 in attack success rate when using IA. The method is effective against complex and stealthy jailbreak attacks, including multilingual and encryption-based attacks. IA maintains the helpfulness of LLMs on harmless queries and ensures the safety of responses to harmful queries by providing detailed explanations for refusals. The method is robust to different IA prompts and can be applied to various LLMs, including both open-source and closed-source models. The paper also discusses the limitations of the method, including the exclusion of GPT-4 from experiments due to budget constraints and the need for further testing to validate the practical applicability of IA in real-world scenarios. The research highlights the importance of intention analysis in improving LLM safety and suggests future work focusing on integrating this into training to reduce inference costs. The study emphasizes the need for more effective and robust defense strategies for LLMs in the face of rapidly advancing adversarial attacks.
Reach us at info@study.space