τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

17 Jun 2024 | Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan
τ-bench is a benchmark designed to evaluate the interaction between language agents and human users in real-world domains. It simulates dynamic conversations between a user (modeled by language models) and a language agent equipped with domain-specific API tools and policy guidelines. The benchmark uses an efficient and faithful evaluation process that compares the final database state with the annotated goal state. A new metric, pass^k, is introduced to assess the reliability of agent behavior across multiple trials. Experiments show that even state-of-the-art function calling agents like gpt-4o succeed on less than 50% of tasks and are inconsistent, with pass^8 being less than 25% in retail. These findings highlight the need for methods that improve agents' ability to act consistently and follow domain-specific rules. τ-bench is built with a modular framework, including realistic databases, APIs, domain-specific policies, and diverse user scenarios. It focuses on customer service domains like τ-retail and τ-airline, where agents assist simulated users with various requests. The benchmark evaluates agents' ability to interact with users and APIs while adhering to domain-specific policies. It also introduces a pass^k metric to measure agent consistency and robustness across multiple trials. The benchmark is designed to test agents in realistic settings, with a focus on long-context reasoning and planning. τ-bench provides a modular structure for extending to new domains and includes detailed task annotations and evaluations. The benchmark is used to evaluate agents' performance in real-world tasks involving human interaction, highlighting the need for more sophisticated agent architectures to handle complex, real-world scenarios.τ-bench is a benchmark designed to evaluate the interaction between language agents and human users in real-world domains. It simulates dynamic conversations between a user (modeled by language models) and a language agent equipped with domain-specific API tools and policy guidelines. The benchmark uses an efficient and faithful evaluation process that compares the final database state with the annotated goal state. A new metric, pass^k, is introduced to assess the reliability of agent behavior across multiple trials. Experiments show that even state-of-the-art function calling agents like gpt-4o succeed on less than 50% of tasks and are inconsistent, with pass^8 being less than 25% in retail. These findings highlight the need for methods that improve agents' ability to act consistently and follow domain-specific rules. τ-bench is built with a modular framework, including realistic databases, APIs, domain-specific policies, and diverse user scenarios. It focuses on customer service domains like τ-retail and τ-airline, where agents assist simulated users with various requests. The benchmark evaluates agents' ability to interact with users and APIs while adhering to domain-specific policies. It also introduces a pass^k metric to measure agent consistency and robustness across multiple trials. The benchmark is designed to test agents in realistic settings, with a focus on long-context reasoning and planning. τ-bench provides a modular structure for extending to new domains and includes detailed task annotations and evaluations. The benchmark is used to evaluate agents' performance in real-world tasks involving human interaction, highlighting the need for more sophisticated agent architectures to handle complex, real-world scenarios.
Reach us at info@study.space
Understanding %CF%84-bench%3A A Benchmark for Tool-Agent-User Interaction in Real-World Domains