17 Feb 2024 | Wenkai Yang*, Xiaohan Bi*, Yankai Lin†, Sishuo Chen, Jie Zhou, Xu Sun
This paper investigates the security threats posed by backdoor attacks on LLM-based agents, which are agents that use Large Language Models (LLMs) to perform various tasks in real-world applications. The authors formulate a general framework for agent backdoor attacks and analyze different forms of these attacks, including those that manipulate the final output distribution and those that introduce malicious behavior in intermediate reasoning processes while keeping the final output correct. They propose data poisoning mechanisms to implement these attacks on two benchmark datasets, AgentInstruct and ToolBench. The results show that LLM-based agents are highly vulnerable to backdoor attacks, highlighting the need for further research on developing defenses against such attacks. The study also discusses the limitations of the work and calls for caution in using third-party agent data and agents.This paper investigates the security threats posed by backdoor attacks on LLM-based agents, which are agents that use Large Language Models (LLMs) to perform various tasks in real-world applications. The authors formulate a general framework for agent backdoor attacks and analyze different forms of these attacks, including those that manipulate the final output distribution and those that introduce malicious behavior in intermediate reasoning processes while keeping the final output correct. They propose data poisoning mechanisms to implement these attacks on two benchmark datasets, AgentInstruct and ToolBench. The results show that LLM-based agents are highly vulnerable to backdoor attacks, highlighting the need for further research on developing defenses against such attacks. The study also discusses the limitations of the work and calls for caution in using third-party agent data and agents.