τ-bench is a benchmark designed to evaluate the interaction between language agents and human users in real-world domains. It simulates dynamic conversations between a user (modeled by language models) and a language agent equipped with domain-specific API tools and policy guidelines. The benchmark uses an efficient and faithful evaluation process that compares the final database state with the annotated goal state. A new metric, pass^k, is introduced to assess the reliability of agent behavior across multiple trials. Experiments show that even state-of-the-art function calling agents like gpt-4o succeed on less than 50% of tasks and are inconsistent, with pass^8 being less than 25% in retail. These findings highlight the need for methods that improve agents' ability to act consistently and follow domain-specific rules. τ-bench is built with a modular framework, including realistic databases, APIs, domain-specific policies, and diverse user scenarios. It focuses on customer service domains like τ-retail and τ-airline, where agents assist simulated users with various requests. The benchmark evaluates agents' ability to interact with users and APIs while adhering to domain-specific policies. It also introduces a pass^k metric to measure agent consistency and robustness across multiple trials. The benchmark is designed to test agents in realistic settings, with a focus on long-context reasoning and planning. τ-bench provides a modular structure for extending to new domains and includes detailed task annotations and evaluations. The benchmark is used to evaluate agents' performance in real-world tasks involving human interaction, highlighting the need for more sophisticated agent architectures to handle complex, real-world scenarios.τ-bench is a benchmark designed to evaluate the interaction between language agents and human users in real-world domains. It simulates dynamic conversations between a user (modeled by language models) and a language agent equipped with domain-specific API tools and policy guidelines. The benchmark uses an efficient and faithful evaluation process that compares the final database state with the annotated goal state. A new metric, pass^k, is introduced to assess the reliability of agent behavior across multiple trials. Experiments show that even state-of-the-art function calling agents like gpt-4o succeed on less than 50% of tasks and are inconsistent, with pass^8 being less than 25% in retail. These findings highlight the need for methods that improve agents' ability to act consistently and follow domain-specific rules. τ-bench is built with a modular framework, including realistic databases, APIs, domain-specific policies, and diverse user scenarios. It focuses on customer service domains like τ-retail and τ-airline, where agents assist simulated users with various requests. The benchmark evaluates agents' ability to interact with users and APIs while adhering to domain-specific policies. It also introduces a pass^k metric to measure agent consistency and robustness across multiple trials. The benchmark is designed to test agents in realistic settings, with a focus on long-context reasoning and planning. τ-bench provides a modular structure for extending to new domains and includes detailed task annotations and evaluations. The benchmark is used to evaluate agents' performance in real-world tasks involving human interaction, highlighting the need for more sophisticated agent architectures to handle complex, real-world scenarios.