Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

17 Feb 2024 | Wen Kai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun
This paper investigates backdoor attacks on LLM-based agents, a critical security issue that has been underexplored. LLM-based agents, which leverage large language models (LLMs) for complex tasks like web shopping and tool utilization, are vulnerable to attacks that manipulate their reasoning processes or outputs. The authors propose a general framework for agent backdoor attacks, categorizing them into two main types: those that manipulate the final output distribution and those that introduce malicious behavior in intermediate reasoning without affecting the final output. They further subdivide the first category based on where the backdoor trigger is hidden—either in the user query or in an intermediate observation from the environment. The study demonstrates that LLM-based agents are highly susceptible to backdoor attacks, with experiments showing that even a small number of poisoned samples can significantly degrade performance. The authors implement various backdoor attack scenarios on two benchmark tasks, WebShop and ToolBench, and show that the agents' ability to complete tasks is compromised. The results highlight the need for further research into defending against backdoor attacks on LLM-based agents to ensure their reliability and security. The paper also discusses the implications of backdoor attacks on LLM-based agents, noting that they can be more隐蔽 (concealed) than attacks on traditional LLMs, making them harder to detect. The authors emphasize the importance of developing robust defenses against such attacks to protect the security and integrity of LLM-based agents in real-world applications. The study contributes to the growing body of research on the safety and security of large language models and their applications.This paper investigates backdoor attacks on LLM-based agents, a critical security issue that has been underexplored. LLM-based agents, which leverage large language models (LLMs) for complex tasks like web shopping and tool utilization, are vulnerable to attacks that manipulate their reasoning processes or outputs. The authors propose a general framework for agent backdoor attacks, categorizing them into two main types: those that manipulate the final output distribution and those that introduce malicious behavior in intermediate reasoning without affecting the final output. They further subdivide the first category based on where the backdoor trigger is hidden—either in the user query or in an intermediate observation from the environment. The study demonstrates that LLM-based agents are highly susceptible to backdoor attacks, with experiments showing that even a small number of poisoned samples can significantly degrade performance. The authors implement various backdoor attack scenarios on two benchmark tasks, WebShop and ToolBench, and show that the agents' ability to complete tasks is compromised. The results highlight the need for further research into defending against backdoor attacks on LLM-based agents to ensure their reliability and security. The paper also discusses the implications of backdoor attacks on LLM-based agents, noting that they can be more隐蔽 (concealed) than attacks on traditional LLMs, making them harder to detect. The authors emphasize the importance of developing robust defenses against such attacks to protect the security and integrity of LLM-based agents in real-world applications. The study contributes to the growing body of research on the safety and security of large language models and their applications.
Reach us at info@study.space
Understanding Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents