AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

17 Jul 2024 | Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li
AGENTPOISON is a novel red-teaming approach that targets RAG-based LLM agents by poisoning their memory or knowledge base. The method introduces a constrained optimization process to generate backdoor triggers that ensure malicious demonstrations are retrieved when a user instruction contains the trigger. Unlike traditional backdoor attacks, AGENTPOISON requires no additional model training and exhibits high transferability, stealthiness, and in-context coherence. The attack successfully targets three types of real-world LLM agents: autonomous driving, knowledge-intensive QA, and healthcare. AGENTPOISON achieves an average attack success rate of over 80% with minimal impact on benign performance (≤1%) and a poison rate ≤0.1%. The optimized trigger is effective across various RAG embedders and resilient to perturbations and defenses. The method is evaluated on multiple metrics, including retrieval success rate, target action success rate, and benign accuracy, demonstrating its effectiveness compared to baseline attacks. The code and data are available at https://github.com/BillChan226/AgentPoison.AGENTPOISON is a novel red-teaming approach that targets RAG-based LLM agents by poisoning their memory or knowledge base. The method introduces a constrained optimization process to generate backdoor triggers that ensure malicious demonstrations are retrieved when a user instruction contains the trigger. Unlike traditional backdoor attacks, AGENTPOISON requires no additional model training and exhibits high transferability, stealthiness, and in-context coherence. The attack successfully targets three types of real-world LLM agents: autonomous driving, knowledge-intensive QA, and healthcare. AGENTPOISON achieves an average attack success rate of over 80% with minimal impact on benign performance (≤1%) and a poison rate ≤0.1%. The optimized trigger is effective across various RAG embedders and resilient to perturbations and defenses. The method is evaluated on multiple metrics, including retrieval success rate, target action success rate, and benign accuracy, demonstrating its effectiveness compared to baseline attacks. The code and data are available at https://github.com/BillChan226/AgentPoison.
Reach us at info@study.space
[slides] AgentPoison%3A Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases | StudySpace