July 2, 2024 | Sayash Kapoor*, Benedikt Stroebel*, Zachary S. Siegel, Nitya Nadgir, Arvind Narayanan
AI agents are an emerging research area, and their development is driven by benchmarks. However, current agent benchmarks and evaluation practices have several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without considering other metrics, leading to overly complex and costly agents. Second, the needs of model developers and downstream developers are conflated, making it difficult to determine which agent is best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, leading to agents that are fragile and overfit to the benchmark. Fourth, there is a lack of standardization in evaluation practices, resulting in a lack of reproducibility. The authors propose a principled framework for avoiding overfitting and suggest that agent evaluations should be cost-controlled. They also argue that downstream evaluation should account for dollar costs rather than proxies for cost. The paper introduces several contributions, including the importance of cost-controlled evaluations, the potential for jointly optimizing accuracy and cost, the distinct benchmarking needs of model and downstream developers, the problem of shortcuts in agent benchmarks, and the need for standardized evaluation practices. The authors emphasize that agent evaluations must be designed to reflect real-world applications and not just benchmark performance. They also highlight the importance of considering human supervision and feedback in agent evaluations. Finally, they argue that the lack of standardized evaluation practices leads to irreproducible results and call for the development of a standardized evaluation framework for AI agents.AI agents are an emerging research area, and their development is driven by benchmarks. However, current agent benchmarks and evaluation practices have several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without considering other metrics, leading to overly complex and costly agents. Second, the needs of model developers and downstream developers are conflated, making it difficult to determine which agent is best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, leading to agents that are fragile and overfit to the benchmark. Fourth, there is a lack of standardization in evaluation practices, resulting in a lack of reproducibility. The authors propose a principled framework for avoiding overfitting and suggest that agent evaluations should be cost-controlled. They also argue that downstream evaluation should account for dollar costs rather than proxies for cost. The paper introduces several contributions, including the importance of cost-controlled evaluations, the potential for jointly optimizing accuracy and cost, the distinct benchmarking needs of model and downstream developers, the problem of shortcuts in agent benchmarks, and the need for standardized evaluation practices. The authors emphasize that agent evaluations must be designed to reflect real-world applications and not just benchmark performance. They also highlight the importance of considering human supervision and feedback in agent evaluations. Finally, they argue that the lack of standardized evaluation practices leads to irreproducible results and call for the development of a standardized evaluation framework for AI agents.