19 Jun 2024 | Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu
StableToolBench is a benchmark designed to enhance the stability of tool learning evaluations for large language models (LLMs). It addresses the instability of real-world APIs and the variability in evaluation results by introducing a virtual API server and a stable evaluation system. The virtual API server includes a caching system and API simulators to ensure consistent API behavior, while the stable evaluation system uses GPT-4 to assess task solvability and win rates, reducing randomness in evaluations.
The paper highlights the instability of existing benchmarks like ToolBench, where API status changes and evaluation methods lead to inconsistent results over time. StableToolBench improves this by simulating API responses and caching results, ensuring consistent performance across different scenarios. Experiments show that StableToolBench provides more stable and reliable evaluations, with simulated APIs maintaining realism and the caching system enhancing benchmark stability.
The benchmark also includes a Turing Test to evaluate the effectiveness of API simulators, demonstrating that they can closely mimic real APIs. Additionally, the diversity of simulated APIs is analyzed, showing that they maintain similar functionality to real APIs. The caching system is effective in maintaining stability, with high cache hit rates for in-domain models.
The evaluation system uses GPT-4 to assess task solvability and win rates, outperforming GPT-3.5 in accuracy and reliability. Human evaluations confirm that GPT-4 is more effective in determining task solvability than GPT-3.5. The benchmark also addresses limitations, such as the reliance on GPT-4 and the potential impact of future LLM upgrades.
Overall, StableToolBench provides a more stable and reliable benchmark for evaluating LLMs in tool learning, with improvements in realism, diversity, and evaluation consistency. The benchmark is designed to support future research and development in tool learning by ensuring stable and accurate evaluations.StableToolBench is a benchmark designed to enhance the stability of tool learning evaluations for large language models (LLMs). It addresses the instability of real-world APIs and the variability in evaluation results by introducing a virtual API server and a stable evaluation system. The virtual API server includes a caching system and API simulators to ensure consistent API behavior, while the stable evaluation system uses GPT-4 to assess task solvability and win rates, reducing randomness in evaluations.
The paper highlights the instability of existing benchmarks like ToolBench, where API status changes and evaluation methods lead to inconsistent results over time. StableToolBench improves this by simulating API responses and caching results, ensuring consistent performance across different scenarios. Experiments show that StableToolBench provides more stable and reliable evaluations, with simulated APIs maintaining realism and the caching system enhancing benchmark stability.
The benchmark also includes a Turing Test to evaluate the effectiveness of API simulators, demonstrating that they can closely mimic real APIs. Additionally, the diversity of simulated APIs is analyzed, showing that they maintain similar functionality to real APIs. The caching system is effective in maintaining stability, with high cache hit rates for in-domain models.
The evaluation system uses GPT-4 to assess task solvability and win rates, outperforming GPT-3.5 in accuracy and reliability. Human evaluations confirm that GPT-4 is more effective in determining task solvability than GPT-3.5. The benchmark also addresses limitations, such as the reliance on GPT-4 and the potential impact of future LLM upgrades.
Overall, StableToolBench provides a more stable and reliable benchmark for evaluating LLMs in tool learning, with improvements in realism, diversity, and evaluation consistency. The benchmark is designed to support future research and development in tool learning by ensuring stable and accurate evaluations.