Understanding AgentPoison%3A Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

**AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases** **Abstract:** Large Language Models (LLMs) have demonstrated remarkable performance in various applications, primarily due to their advanced reasoning capabilities, external knowledge utilization, API calling, and action execution. Current agents often use a memory module or a retrieval-augmented generation (RAG) mechanism to retrieve past knowledge and instances from knowledge bases. However, the reliance on unverified knowledge bases raises significant safety and trustworthiness concerns. To address these issues, we propose AGENTPOISON, a novel red-teaming approach that targets generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. AGENTPOISON formulates the trigger generation process as a constrained optimization to optimize backdoor triggers, ensuring that malicious demonstrations are retrieved with high probability when a user instruction contains the optimized trigger. Unlike conventional backdoor attacks, AGENTPOISON requires no additional model training or fine-tuning and exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments on three real-world LLM agents—RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent—demonstrate the effectiveness of AGENTPOISON, achieving an average attack success rate of ≥80% with minimal impact on benign performance (≤1%) and a poison rate <0.1%. **Introduction:** Recent advancements in LLMs have led to their extensive deployment in safety-critical applications. However, the trustworthiness of these agents, particularly their reliance on potentially unreliable knowledge bases, remains a significant concern. Current attacks targeting LLMs, such as jailbreaking and backdoor attacks, often fail to effectively attack RAG-based agents. AGENTPOISON addresses this gap by poisoning the long-term memory or RAG knowledge base of LLM agents using a small number of malicious demonstrations. The goal is to induce the retrieval of malicious demonstrations when a user query contains an optimized trigger, guiding the agent to generate adversarial actions while maintaining normal performance for benign queries. **Method:** AGENTPOISON optimizes a trigger that achieves both objectives of the attacker: generating prescribed adversarial actions and ensuring normal performance for benign queries. The optimization is cast as a constrained optimization problem to maximize retrieval effectiveness, target generation, and coherence. The key idea is to map triggered queries to a unique region in the embedding space, ensuring high retrieval accuracy and end-to-end attack success rate. AGENTPOISON does not require additional model training and is more stealthy due to its optimization of query coherence. **Experiments:** AGENTPOISON is evaluated on three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. The results show that AGENTPOISON outperforms baseline attacks with an average attack success rate of 82% and 63% end-to-end attack success**AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases** **Abstract:** Large Language Models (LLMs) have demonstrated remarkable performance in various applications, primarily due to their advanced reasoning capabilities, external knowledge utilization, API calling, and action execution. Current agents often use a memory module or a retrieval-augmented generation (RAG) mechanism to retrieve past knowledge and instances from knowledge bases. However, the reliance on unverified knowledge bases raises significant safety and trustworthiness concerns. To address these issues, we propose AGENTPOISON, a novel red-teaming approach that targets generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. AGENTPOISON formulates the trigger generation process as a constrained optimization to optimize backdoor triggers, ensuring that malicious demonstrations are retrieved with high probability when a user instruction contains the optimized trigger. Unlike conventional backdoor attacks, AGENTPOISON requires no additional model training or fine-tuning and exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments on three real-world LLM agents—RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent—demonstrate the effectiveness of AGENTPOISON, achieving an average attack success rate of ≥80% with minimal impact on benign performance (≤1%) and a poison rate <0.1%. **Introduction:** Recent advancements in LLMs have led to their extensive deployment in safety-critical applications. However, the trustworthiness of these agents, particularly their reliance on potentially unreliable knowledge bases, remains a significant concern. Current attacks targeting LLMs, such as jailbreaking and backdoor attacks, often fail to effectively attack RAG-based agents. AGENTPOISON addresses this gap by poisoning the long-term memory or RAG knowledge base of LLM agents using a small number of malicious demonstrations. The goal is to induce the retrieval of malicious demonstrations when a user query contains an optimized trigger, guiding the agent to generate adversarial actions while maintaining normal performance for benign queries. **Method:** AGENTPOISON optimizes a trigger that achieves both objectives of the attacker: generating prescribed adversarial actions and ensuring normal performance for benign queries. The optimization is cast as a constrained optimization problem to maximize retrieval effectiveness, target generation, and coherence. The key idea is to map triggered queries to a unique region in the embedding space, ensuring high retrieval accuracy and end-to-end attack success rate. AGENTPOISON does not require additional model training and is more stealthy due to its optimization of query coherence. **Experiments:** AGENTPOISON is evaluated on three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. The results show that AGENTPOISON outperforms baseline attacks with an average attack success rate of 82% and 63% end-to-end attack success

AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

17 Jul 2024 | Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li