[slides] StableToolBench%3A Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

The paper "StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models" addresses the issue of instability in evaluating the tool learning capabilities of large language models (LLMs). The authors identify that previous benchmarks, such as ToolBench, suffer from limitations due to the use of hand-crafted or real online tools, which are either limited in scale or unstable in API status. To address these issues, they propose StableToolBench, a new benchmark that incorporates a virtual API server and a stable evaluation system. The virtual API server includes a caching system and API simulators to ensure stable and consistent API behavior. The caching system stores the outputs of API calls, while the API simulators use large language models (LLMs) to mimic the behavior of unavailable APIs based on documentation and few-shot real API calls. The stable evaluation system introduces new metrics (Solvable Pass Rate and Solvable Win Rate) and uses GPT-4 as the automatic evaluator to reduce randomness and improve reproducibility. Experiments demonstrate that StableToolBench provides more stable model performance and robustness to various types of API failures. The system also exhibits high reliability in terms of realism, diversity, and documentation accuracy. The paper concludes by discussing the effectiveness of the caching system, API simulators, and the evaluation system, highlighting their contributions to enhancing the stability and realism of the benchmark.The paper "StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models" addresses the issue of instability in evaluating the tool learning capabilities of large language models (LLMs). The authors identify that previous benchmarks, such as ToolBench, suffer from limitations due to the use of hand-crafted or real online tools, which are either limited in scale or unstable in API status. To address these issues, they propose StableToolBench, a new benchmark that incorporates a virtual API server and a stable evaluation system. The virtual API server includes a caching system and API simulators to ensure stable and consistent API behavior. The caching system stores the outputs of API calls, while the API simulators use large language models (LLMs) to mimic the behavior of unavailable APIs based on documentation and few-shot real API calls. The stable evaluation system introduces new metrics (Solvable Pass Rate and Solvable Win Rate) and uses GPT-4 as the automatic evaluator to reduce randomness and improve reproducibility. Experiments demonstrate that StableToolBench provides more stable model performance and robustness to various types of API failures. The system also exhibits high reliability in terms of realism, diversity, and documentation accuracy. The paper concludes by discussing the effectiveness of the caching system, API simulators, and the evaluation system, highlighting their contributions to enhancing the stability and realism of the benchmark.

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

19 Jun 2024 | Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu