25 Jul 2024 | Haibo Jin, Leyang Hu, Xinnuo Li, Peiyian Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
This paper presents a comprehensive survey of jailbreaking techniques and defense mechanisms for large language models (LLMs) and vision-language models (VLMs). The rapid development of AI has led to significant advancements in LLMs and VLMs, but their growing adoption raises concerns about security and ethical alignment. Jailbreaking refers to the deliberate manipulation of AI systems to produce outputs that violate ethical guidelines. This survey categorizes jailbreaks into seven types and explores defense strategies that address these vulnerabilities. The paper also identifies research gaps and proposes future directions for improving the security of LLMs and VLMs.
The paper begins with an introduction to the field, followed by a background section that discusses ethical alignment techniques such as prompt-tuning and reinforcement learning from human feedback (RLHF). It then covers the jailbreaking process of LLMs and VLMs, detailing various jailbreak strategies and defense mechanisms. The paper also provides a comprehensive evaluation of these defenses and additional resources.
The survey discusses threats in LLMs, including gradient-based, evolutionary-based, demonstration-based, rule-based, and multi-agent-based jailbreaks. It also explores threats in VLMs, including prompt-to-image injection, prompt-image perturbation injection, and proxy model transfer jailbreaks. The paper presents defense mechanisms for both LLMs and VLMs, including prompt detection, prompt perturbation, demonstration-based defenses, and response evaluation.
The paper concludes with a discussion of future research directions and summarizes the key contributions of the survey. The main contributions include a fine-grained categorization of jailbreak strategies and defense mechanisms, a unified perspective on both jailbreak strategies and defense mechanisms, and the identification of research gaps and future directions for improving the security of LLMs and VLMs.JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
This paper presents a comprehensive survey of jailbreaking techniques and defense mechanisms for large language models (LLMs) and vision-language models (VLMs). The rapid development of AI has led to significant advancements in LLMs and VLMs, but their growing adoption raises concerns about security and ethical alignment. Jailbreaking refers to the deliberate manipulation of AI systems to produce outputs that violate ethical guidelines. This survey categorizes jailbreaks into seven types and explores defense strategies that address these vulnerabilities. The paper also identifies research gaps and proposes future directions for improving the security of LLMs and VLMs.
The paper begins with an introduction to the field, followed by a background section that discusses ethical alignment techniques such as prompt-tuning and reinforcement learning from human feedback (RLHF). It then covers the jailbreaking process of LLMs and VLMs, detailing various jailbreak strategies and defense mechanisms. The paper also provides a comprehensive evaluation of these defenses and additional resources.
The survey discusses threats in LLMs, including gradient-based, evolutionary-based, demonstration-based, rule-based, and multi-agent-based jailbreaks. It also explores threats in VLMs, including prompt-to-image injection, prompt-image perturbation injection, and proxy model transfer jailbreaks. The paper presents defense mechanisms for both LLMs and VLMs, including prompt detection, prompt perturbation, demonstration-based defenses, and response evaluation.
The paper concludes with a discussion of future research directions and summarizes the key contributions of the survey. The main contributions include a fine-grained categorization of jailbreak strategies and defense mechanisms, a unified perspective on both jailbreak strategies and defense mechanisms, and the identification of research gaps and future directions for improving the security of LLMs and VLMs.