AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

14 Nov 2024 | Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu
AutoDefense is a multi-agent defense framework designed to protect large language models (LLMs) from jailbreak attacks, which are adversarial methods that force LLMs to generate harmful content. The framework employs a response-filtering mechanism to identify and filter out harmful responses, ensuring that even if an attack bypasses the LLM's defenses, the system can override the response with a safe alternative. The multi-agent design enhances the robustness of the defense by dividing the task into sub-tasks, allowing each agent to focus on a specific aspect of the defense strategy. This approach leverages the inherent alignment abilities of LLMs and enables the integration of other defense components as additional agents. Experiments demonstrate that AutoDefense effectively reduces the attack success rate (ASR) while maintaining a low false positive rate (FPR) on safe content. The framework is flexible and can be adapted to different LLMs, including smaller, more efficient models like LLaMA-2-13b, which can significantly reduce the ASR on GPT-3.5 from 55.74% to 7.95% using a three-agent system. The code and data for AutoDefense are publicly available, making it a valuable tool for researchers and practitioners in the field of LLM security.AutoDefense is a multi-agent defense framework designed to protect large language models (LLMs) from jailbreak attacks, which are adversarial methods that force LLMs to generate harmful content. The framework employs a response-filtering mechanism to identify and filter out harmful responses, ensuring that even if an attack bypasses the LLM's defenses, the system can override the response with a safe alternative. The multi-agent design enhances the robustness of the defense by dividing the task into sub-tasks, allowing each agent to focus on a specific aspect of the defense strategy. This approach leverages the inherent alignment abilities of LLMs and enables the integration of other defense components as additional agents. Experiments demonstrate that AutoDefense effectively reduces the attack success rate (ASR) while maintaining a low false positive rate (FPR) on safe content. The framework is flexible and can be adapted to different LLMs, including smaller, more efficient models like LLaMA-2-13b, which can significantly reduce the ASR on GPT-3.5 from 55.74% to 7.95% using a three-agent system. The code and data for AutoDefense are publicly available, making it a valuable tool for researchers and practitioners in the field of LLM security.
Reach us at info@study.space