BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

5 Jun 2024 | Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian
BadAgent: Backdoor Attacks on LLM Agents This paper introduces BadAgent, a backdoor attack method targeting large language model (LLM) agents. LLM agents are systems that use LLMs to reason through problems, create plans, and execute tasks using a set of tools. These agents can perform tasks such as server management, automatic shopping, and web navigation. However, the authors show that LLM agents are vulnerable to backdoor attacks, where an attacker can manipulate the agent to perform harmful operations by inserting a trigger in the input or environment. The BadAgent attack involves embedding a backdoor by poisoning data during fine-tuning for the agent tasks. There are two types of attacks: active and passive. Active attacks are triggered when the attacker inputs a concealed trigger to the LLM agent. Passive attacks work when the LLM agent detects specific environmental conditions. The attack methods are robust even after fine-tuning on trustworthy data. The authors conducted experiments on three state-of-the-art LLM agents, two fine-tuning methods, and three typical agent tasks. The results show that the proposed attack methods achieve over 85% attack success rates. The attack methods are also robust to data-centric defense methods, such as fine-tuning on trustworthy data. The paper discusses the threat model, attack methods, and defense strategies for backdoor attacks on LLM agents. The authors propose that the effective ways to defend LLM agents against these attacks can be developed from two perspectives: (1) Employing specialized detection methods to identify backdoors within models, and (2) Conducting decontamination at the parameter level to reduce backdoor risks within models. The paper concludes that LLM agents are at risk when the trained weights or training data of these super-large LLM agents are not trustworthy. The authors hope their work can promote the consideration of LLM security and encourage the research of more secure and reliable LLM agents.BadAgent: Backdoor Attacks on LLM Agents This paper introduces BadAgent, a backdoor attack method targeting large language model (LLM) agents. LLM agents are systems that use LLMs to reason through problems, create plans, and execute tasks using a set of tools. These agents can perform tasks such as server management, automatic shopping, and web navigation. However, the authors show that LLM agents are vulnerable to backdoor attacks, where an attacker can manipulate the agent to perform harmful operations by inserting a trigger in the input or environment. The BadAgent attack involves embedding a backdoor by poisoning data during fine-tuning for the agent tasks. There are two types of attacks: active and passive. Active attacks are triggered when the attacker inputs a concealed trigger to the LLM agent. Passive attacks work when the LLM agent detects specific environmental conditions. The attack methods are robust even after fine-tuning on trustworthy data. The authors conducted experiments on three state-of-the-art LLM agents, two fine-tuning methods, and three typical agent tasks. The results show that the proposed attack methods achieve over 85% attack success rates. The attack methods are also robust to data-centric defense methods, such as fine-tuning on trustworthy data. The paper discusses the threat model, attack methods, and defense strategies for backdoor attacks on LLM agents. The authors propose that the effective ways to defend LLM agents against these attacks can be developed from two perspectives: (1) Employing specialized detection methods to identify backdoors within models, and (2) Conducting decontamination at the parameter level to reduce backdoor risks within models. The paper concludes that LLM agents are at risk when the trained weights or training data of these super-large LLM agents are not trustworthy. The authors hope their work can promote the consideration of LLM security and encourage the research of more secure and reliable LLM agents.
Reach us at info@study.space
[slides] BadAgent%3A Inserting and Activating Backdoor Attacks in LLM Agents | StudySpace