Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

5 Jul 2024 | Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li
This paper presents a comprehensive survey of jailbreak attacks and defenses against Large Language Models (LLMs). It categorizes jailbreak attacks into white-box and black-box attacks based on the transparency of the target model, and defense methods into prompt-level and model-level defenses. The paper also explores various sub-classes of these attacks and defenses, and provides a coherent diagram illustrating their relationships. It investigates current evaluation methods and compares them from different perspectives. The findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. LLMs have shown exceptional performance in text-generative tasks, but their over-assistance has raised the challenge of "jailbreaking," where malicious actors exploit vulnerabilities in the model's architecture or implementation to elicit harmful behaviors. Jailbreak attacks represent a unique and evolving threat landscape that demands careful examination and mitigation strategies. The paper discusses various attack vectors, techniques, and case studies to elucidate the underlying vulnerabilities and potential impact on model security and integrity. It also discusses existing countermeasures and strategies for mitigating the risks associated with jailbreak attacks. The paper provides a systematic taxonomy of both jailbreak attack and defense methods. According to the transparency level of the target LLM to attackers, attack methods are categorized into white-box and black-box attacks, and further divided into sub-classes for further investigation. Similarly, defense methods are categorized into prompt-level and model-level defenses, which implies whether the safety measure modifies the protected LLM or not. The detailed definitions of the methods are listed in Table 1. The paper highlights the relationships between different attack and defense methods. Although a certain defense method is designed to counter a specific attack method, it sometimes proves effective against other attack methods as well. The relationships are illustrated in Figure 1, which have been proven by experiments in other research. The paper also conducts an investigation into current evaluation methods. It briefly introduces the popular metric in jailbreak research and summarizes current benchmarks including some frameworks and datasets. The paper discusses various types of jailbreak attacks, including gradient-based, logits-based, and fine-tuning-based attacks. It also discusses black-box attacks, including template completion, prompt rewriting, and LLM-based generation. The paper highlights the effectiveness of these attacks and the corresponding defense methods. The paper also discusses various defense methods, including prompt-level and model-level defenses. Prompt-level defenses directly probe the input prompts and eliminate the malicious content before they are fed into the language model for generation. Model-level defenses leave the prompts unchanged and fine-tune the language model to enhance the intrinsic safety guardrails so that the models decline to answer the harmful requests. The paper concludes that jailbreak attacks remain a significant concern within the community, but the work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.This paper presents a comprehensive survey of jailbreak attacks and defenses against Large Language Models (LLMs). It categorizes jailbreak attacks into white-box and black-box attacks based on the transparency of the target model, and defense methods into prompt-level and model-level defenses. The paper also explores various sub-classes of these attacks and defenses, and provides a coherent diagram illustrating their relationships. It investigates current evaluation methods and compares them from different perspectives. The findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. LLMs have shown exceptional performance in text-generative tasks, but their over-assistance has raised the challenge of "jailbreaking," where malicious actors exploit vulnerabilities in the model's architecture or implementation to elicit harmful behaviors. Jailbreak attacks represent a unique and evolving threat landscape that demands careful examination and mitigation strategies. The paper discusses various attack vectors, techniques, and case studies to elucidate the underlying vulnerabilities and potential impact on model security and integrity. It also discusses existing countermeasures and strategies for mitigating the risks associated with jailbreak attacks. The paper provides a systematic taxonomy of both jailbreak attack and defense methods. According to the transparency level of the target LLM to attackers, attack methods are categorized into white-box and black-box attacks, and further divided into sub-classes for further investigation. Similarly, defense methods are categorized into prompt-level and model-level defenses, which implies whether the safety measure modifies the protected LLM or not. The detailed definitions of the methods are listed in Table 1. The paper highlights the relationships between different attack and defense methods. Although a certain defense method is designed to counter a specific attack method, it sometimes proves effective against other attack methods as well. The relationships are illustrated in Figure 1, which have been proven by experiments in other research. The paper also conducts an investigation into current evaluation methods. It briefly introduces the popular metric in jailbreak research and summarizes current benchmarks including some frameworks and datasets. The paper discusses various types of jailbreak attacks, including gradient-based, logits-based, and fine-tuning-based attacks. It also discusses black-box attacks, including template completion, prompt rewriting, and LLM-based generation. The paper highlights the effectiveness of these attacks and the corresponding defense methods. The paper also discusses various defense methods, including prompt-level and model-level defenses. Prompt-level defenses directly probe the input prompts and eliminate the malicious content before they are fed into the language model for generation. Model-level defenses leave the prompts unchanged and fine-tune the language model to enhance the intrinsic safety guardrails so that the models decline to answer the harmful requests. The paper concludes that jailbreak attacks remain a significant concern within the community, but the work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.
Reach us at info@study.space