17 May 2024 | Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek
This paper conducts a comprehensive study on jailbreak attacks and defense techniques for large language models (LLMs). It evaluates nine attack techniques and seven defense mechanisms across three models: Vicuna, LLama, and GPT-3.5 Turbo. The study reveals that template-based attacks are more effective than generative methods, and special tokens significantly impact attack success rates. The Bergeron method is identified as the most effective defense, while other methods perform poorly or are too strict. The research highlights the need for more robust defense mechanisms and contributes to the field by releasing a benchmark and testing framework. The findings underscore the importance of comprehensive safety training and the development of advanced evaluation frameworks to enhance LLM security.This paper conducts a comprehensive study on jailbreak attacks and defense techniques for large language models (LLMs). It evaluates nine attack techniques and seven defense mechanisms across three models: Vicuna, LLama, and GPT-3.5 Turbo. The study reveals that template-based attacks are more effective than generative methods, and special tokens significantly impact attack success rates. The Bergeron method is identified as the most effective defense, while other methods perform poorly or are too strict. The research highlights the need for more robust defense mechanisms and contributes to the field by releasing a benchmark and testing framework. The findings underscore the importance of comprehensive safety training and the development of advanced evaluation frameworks to enhance LLM security.