τ-bench is a benchmark designed to evaluate the interaction between language agents and human users in real-world domains, focusing on the agents' ability to follow domain-specific rules and interact consistently. The benchmark emulates dynamic conversations between a simulated user (represented by language models) and a language agent equipped with domain-specific API tools and policy guidelines. The evaluation process compares the database state at the end of each conversation with the annotated goal state, and introduces a new metric, pass^4k, to assess the reliability of agent behavior across multiple trials.
The benchmark is constructed in three stages: manual design of database schema, APIs, and policies; automatic data generation using language models; and manual task annotation and validation. Two domains, τ-retail and τ-airline, are created to test agents' ability to assist simulated users with diverse requests while adhering to domain-specific policies.
Experiments show that even state-of-the-art function calling agents (like gpt-4o) achieve low task success rates and exhibit significant inconsistency, highlighting the need for more sophisticated agent architectures. The findings suggest that current agents struggle with complex reasoning over databases, understanding and following ad-hoc policies, and handling compound requests. The benchmark aims to facilitate the development of more consistent and capable agents for real-world digital tasks involving human interaction.τ-bench is a benchmark designed to evaluate the interaction between language agents and human users in real-world domains, focusing on the agents' ability to follow domain-specific rules and interact consistently. The benchmark emulates dynamic conversations between a simulated user (represented by language models) and a language agent equipped with domain-specific API tools and policy guidelines. The evaluation process compares the database state at the end of each conversation with the annotated goal state, and introduces a new metric, pass^4k, to assess the reliability of agent behavior across multiple trials.
The benchmark is constructed in three stages: manual design of database schema, APIs, and policies; automatic data generation using language models; and manual task annotation and validation. Two domains, τ-retail and τ-airline, are created to test agents' ability to assist simulated users with diverse requests while adhering to domain-specific policies.
Experiments show that even state-of-the-art function calling agents (like gpt-4o) achieve low task success rates and exhibit significant inconsistency, highlighting the need for more sophisticated agent architectures. The findings suggest that current agents struggle with complex reasoning over databases, understanding and following ad-hoc policies, and handling compound requests. The benchmark aims to facilitate the development of more consistent and capable agents for real-world digital tasks involving human interaction.