Understanding Jailbreak Attacks and Defenses Against Large Language Models%3A A Survey

The paper "Jailbreak Attacks and Defenses Against Large Language Models: A Survey" by Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li provides a comprehensive overview of jailbreak attacks and defense methods against Large Language Models (LLMs). The authors categorize attack methods into black-box and white-box attacks based on the transparency of the target model, and defense methods into prompt-level and model-level defenses. They also subdivide these methods into distinct sub-classes and present a coherent diagram illustrating their relationships. The paper highlights the evolving nature of jailbreak attacks, which exploit vulnerabilities in LLMs to generate malicious responses. It discusses various attack vectors, techniques, and case studies, emphasizing the need for robust defense mechanisms. The authors investigate current evaluation methods and compare them from different perspectives, aiming to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Key contributions of the paper include: - A systematic taxonomy of jailbreak attack and defense methods. - An investigation into the relationships between different attack and defense methods. - An exploration of current evaluation methods and benchmarks. The paper also delves into the effectiveness of different attack methods, such as gradient-based, logits-based, and fine-tuning-based attacks, and the corresponding defense strategies. It discusses the challenges and limitations of current defenses and highlights the urgent need for more robust methods to protect LLMs from jailbreak attacks.The paper "Jailbreak Attacks and Defenses Against Large Language Models: A Survey" by Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li provides a comprehensive overview of jailbreak attacks and defense methods against Large Language Models (LLMs). The authors categorize attack methods into black-box and white-box attacks based on the transparency of the target model, and defense methods into prompt-level and model-level defenses. They also subdivide these methods into distinct sub-classes and present a coherent diagram illustrating their relationships. The paper highlights the evolving nature of jailbreak attacks, which exploit vulnerabilities in LLMs to generate malicious responses. It discusses various attack vectors, techniques, and case studies, emphasizing the need for robust defense mechanisms. The authors investigate current evaluation methods and compare them from different perspectives, aiming to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Key contributions of the paper include: - A systematic taxonomy of jailbreak attack and defense methods. - An investigation into the relationships between different attack and defense methods. - An exploration of current evaluation methods and benchmarks. The paper also delves into the effectiveness of different attack methods, such as gradient-based, logits-based, and fine-tuning-based attacks, and the corresponding defense strategies. It discusses the challenges and limitations of current defenses and highlights the urgent need for more robust methods to protect LLMs from jailbreak attacks.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

5 Jul 2024 | Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li