17 May 2024 | Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek
This paper presents a comprehensive study of jailbreak attacks and defenses for large language models (LLMs). The research evaluates nine attack techniques and seven defense techniques across three LLMs: Vicuna, LLaMA, and GPT-3.5 Turbo. The study reveals that template-based attack methods are more effective than gradient-based generative approaches, and that the inclusion of special tokens significantly affects attack success. The Bergeron method is identified as the most effective defense strategy, while other defenses either fail to prevent jailbreaks or are too strict. The study also highlights the need for more robust defense mechanisms and contributes to the field by releasing datasets and a testing framework. The findings emphasize the importance of security in LLMs and the need for further research into effective attack and defense strategies. The study also discusses the impact of special tokens on jailbreak performance and the challenges of evaluating defense mechanisms against diverse malicious queries. The research underscores the importance of developing more effective evaluation frameworks and defense strategies to enhance the security of LLMs.This paper presents a comprehensive study of jailbreak attacks and defenses for large language models (LLMs). The research evaluates nine attack techniques and seven defense techniques across three LLMs: Vicuna, LLaMA, and GPT-3.5 Turbo. The study reveals that template-based attack methods are more effective than gradient-based generative approaches, and that the inclusion of special tokens significantly affects attack success. The Bergeron method is identified as the most effective defense strategy, while other defenses either fail to prevent jailbreaks or are too strict. The study also highlights the need for more robust defense mechanisms and contributes to the field by releasing datasets and a testing framework. The findings emphasize the importance of security in LLMs and the need for further research into effective attack and defense strategies. The study also discusses the impact of special tokens on jailbreak performance and the challenges of evaluating defense mechanisms against diverse malicious queries. The research underscores the importance of developing more effective evaluation frameworks and defense strategies to enhance the security of LLMs.