25 Jul 2024 | Haibo Jin, Leyang Hu, Xinnuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang
The paper "JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models" by Haibo Jin, Leyang Hu, Xinnuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang provides a comprehensive survey of the emerging field of jailbreaking in large language models (LLMs) and vision-language models (VLMs). The authors categorize jailbreaks into seven distinct types and detail defense mechanisms to address these vulnerabilities. The paper aims to identify research gaps and propose future directions to enhance the security frameworks of LLMs and VLMs. Key contributions include a fine-grained categorization of jailbreak strategies and defenses, a unified perspective on the interplay between attack and defense methodologies, and the identification of gaps in current research. The paper also discusses ethical alignment techniques such as prompt-tuning and reinforcement learning from human feedback (RLHF) to ensure models adhere to ethical guidelines. The authors explore various jailbreaking methods, including gradient-based, evolutionary-based, demonstration-based, rule-based, and multi-agent-based approaches, and provide detailed explanations of each type. Additionally, they present comprehensive evaluation methods for defense mechanisms and highlight the importance of integrating both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of language models.The paper "JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models" by Haibo Jin, Leyang Hu, Xinnuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang provides a comprehensive survey of the emerging field of jailbreaking in large language models (LLMs) and vision-language models (VLMs). The authors categorize jailbreaks into seven distinct types and detail defense mechanisms to address these vulnerabilities. The paper aims to identify research gaps and propose future directions to enhance the security frameworks of LLMs and VLMs. Key contributions include a fine-grained categorization of jailbreak strategies and defenses, a unified perspective on the interplay between attack and defense methodologies, and the identification of gaps in current research. The paper also discusses ethical alignment techniques such as prompt-tuning and reinforcement learning from human feedback (RLHF) to ensure models adhere to ethical guidelines. The authors explore various jailbreaking methods, including gradient-based, evolutionary-based, demonstration-based, rule-based, and multi-agent-based approaches, and provide detailed explanations of each type. Additionally, they present comprehensive evaluation methods for defense mechanisms and highlight the importance of integrating both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of language models.