AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

2024-11-14 | Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu
AutoDefense is a multi-agent defense framework designed to protect large language models (LLMs) from jailbreak attacks. Despite pre-training to avoid harmful outputs, LLMs remain vulnerable to jailbreak prompts that force them to generate harmful content. AutoDefense employs a response-filtering mechanism to identify and block harmful responses, while maintaining normal performance for user requests. The framework assigns different roles to LLM agents, enabling them to collaborate on defense tasks. This collaborative approach enhances instruction-following and allows integration of other defense components as tools. AutoDefense can use small open-source LLMs as agents to defend larger models. Experiments show that AutoDefense significantly reduces attack success rates (ASR) while keeping false positive rates (FPR) low. For example, using LLaMA-2-13b with a three-agent system, the ASR on GPT-3.5 drops from 55.74% to 7.95%. The framework is flexible, allowing integration of other defense methods like Llama Guard, which further reduces FPR. AutoDefense is effective against various jailbreak attacks and can be applied to different LLMs. The system is designed to be prompt-agnostic, making it robust to different attack methods. AutoDefense's multi-agent design enables efficient and effective defense against jailbreak attacks while preserving the utility of LLMs for regular tasks.AutoDefense is a multi-agent defense framework designed to protect large language models (LLMs) from jailbreak attacks. Despite pre-training to avoid harmful outputs, LLMs remain vulnerable to jailbreak prompts that force them to generate harmful content. AutoDefense employs a response-filtering mechanism to identify and block harmful responses, while maintaining normal performance for user requests. The framework assigns different roles to LLM agents, enabling them to collaborate on defense tasks. This collaborative approach enhances instruction-following and allows integration of other defense components as tools. AutoDefense can use small open-source LLMs as agents to defend larger models. Experiments show that AutoDefense significantly reduces attack success rates (ASR) while keeping false positive rates (FPR) low. For example, using LLaMA-2-13b with a three-agent system, the ASR on GPT-3.5 drops from 55.74% to 7.95%. The framework is flexible, allowing integration of other defense methods like Llama Guard, which further reduces FPR. AutoDefense is effective against various jailbreak attacks and can be applied to different LLMs. The system is designed to be prompt-agnostic, making it robust to different attack methods. AutoDefense's multi-agent design enables efficient and effective defense against jailbreak attacks while preserving the utility of LLMs for regular tasks.
Reach us at info@futurestudyspace.com