18 Jul 2024 | Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, Florian Tramèr
**AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents**
**Authors:** Edoardo Debenedetti
**Affiliations:** ETH Zurich, Invariant Labs
**Abstract:**
AI agents, which combine text-based reasoning with external tool calls, are vulnerable to prompt injection attacks where external data hijacks the agent to execute malicious tasks. To measure adversarial robustness, AgentDojo is introduced as an evaluation framework for agents that execute tools over untrusted data. AgentDojo is not a static test suite but an extensible environment for designing and evaluating new tasks, defenses, and adaptive attacks. It includes 97 realistic tasks, 629 security test cases, and various attack and defense paradigms. The framework poses significant challenges for both attacks and defenses, with state-of-the-art LLMs failing at many tasks and existing prompt injection attacks breaking some security properties. AgentDojo aims to foster research on new design principles for AI agents that solve common tasks reliably and robustly.
**Introduction:**
Large language models (LLMs) can understand and solve complex tasks through natural language instructions. However, prompt injection attacks exploit the lack of formal distinction between instructions and data, allowing external attackers to execute malicious actions on behalf of the user. AgentDojo is designed to evaluate the robustness of AI agents in adversarial settings, providing a dynamic benchmarking framework with realistic tasks and security test cases. It evaluates agents and attackers based on formal utility checks over the environment state, rather than relying on LLMs to simulate environments.
**Related Work and Preliminaries:**
AgentDojo differs from prior benchmarks by requiring agents to dynamically call multiple tools in a stateful, adversarial environment. It evaluates agents and attackers using formal utility checks, addressing the limitations of static benchmarks and existing prompt injection benchmarks. The framework supports new tasks, attacks, and defenses, making it a dynamic and extensible tool for evaluating AI agents' robustness.
**Evaluation:**
The evaluation involves 97 realistic tasks and 629 security test cases, using both closed-source and open-source models. The results show that more capable models are easier to attack, and many defense strategies increase benign utility. The framework also evaluates the impact of injection position and attacker knowledge on attack success rates.
**Conclusion:**
AgentDojo provides a standardized framework for evaluating prompt injection attacks and defenses, consisting of realistic tasks and security test cases. It challenges both attackers and defenders and can serve as a live benchmark for measuring their progress. Future work could enhance AgentDojo by adding more sophisticated attacks and defenses, automating task specification, and expanding the range of tasks and attacks.**AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents**
**Authors:** Edoardo Debenedetti
**Affiliations:** ETH Zurich, Invariant Labs
**Abstract:**
AI agents, which combine text-based reasoning with external tool calls, are vulnerable to prompt injection attacks where external data hijacks the agent to execute malicious tasks. To measure adversarial robustness, AgentDojo is introduced as an evaluation framework for agents that execute tools over untrusted data. AgentDojo is not a static test suite but an extensible environment for designing and evaluating new tasks, defenses, and adaptive attacks. It includes 97 realistic tasks, 629 security test cases, and various attack and defense paradigms. The framework poses significant challenges for both attacks and defenses, with state-of-the-art LLMs failing at many tasks and existing prompt injection attacks breaking some security properties. AgentDojo aims to foster research on new design principles for AI agents that solve common tasks reliably and robustly.
**Introduction:**
Large language models (LLMs) can understand and solve complex tasks through natural language instructions. However, prompt injection attacks exploit the lack of formal distinction between instructions and data, allowing external attackers to execute malicious actions on behalf of the user. AgentDojo is designed to evaluate the robustness of AI agents in adversarial settings, providing a dynamic benchmarking framework with realistic tasks and security test cases. It evaluates agents and attackers based on formal utility checks over the environment state, rather than relying on LLMs to simulate environments.
**Related Work and Preliminaries:**
AgentDojo differs from prior benchmarks by requiring agents to dynamically call multiple tools in a stateful, adversarial environment. It evaluates agents and attackers using formal utility checks, addressing the limitations of static benchmarks and existing prompt injection benchmarks. The framework supports new tasks, attacks, and defenses, making it a dynamic and extensible tool for evaluating AI agents' robustness.
**Evaluation:**
The evaluation involves 97 realistic tasks and 629 security test cases, using both closed-source and open-source models. The results show that more capable models are easier to attack, and many defense strategies increase benign utility. The framework also evaluates the impact of injection position and attacker knowledge on attack success rates.
**Conclusion:**
AgentDojo provides a standardized framework for evaluating prompt injection attacks and defenses, consisting of realistic tasks and security test cases. It challenges both attackers and defenders and can serve as a live benchmark for measuring their progress. Future work could enhance AgentDojo by adding more sophisticated attacks and defenses, automating task specification, and expanding the range of tasks and attacks.