[slides and audio] AnyTool%3A Self-Reflective%2C Hierarchical Agents for Large-Scale API Calls

AnyTool is a large language model agent designed to efficiently utilize a vast array of APIs to address user queries. It incorporates three key components: an API retriever with a hierarchical structure, a solver that uses selected API candidates to resolve queries, and a self-reflection mechanism that re-activates AnyTool if the initial solution is impractical. AnyTool leverages the function calling feature of GPT-4, eliminating the need for external training modules. It revisits the evaluation protocol used in previous works and identifies a limitation that leads to an artificially high pass rate. By revising the evaluation protocol to better reflect practical scenarios, AnyTool introduces a new benchmark called AnyToolBench. Experiments across various datasets demonstrate that AnyTool outperforms strong baselines such as ToolLLM and a GPT-4 variant tailored for tool utilization. For instance, AnyTool achieves a +35.4% improvement in average pass rate on ToolBench compared to ToolLLM. AnyTool's hierarchical API retriever efficiently identifies relevant APIs, while its self-reflection mechanism enhances the efficiency and effectiveness of query resolution by allowing the system to re-activate itself when necessary. The evaluation framework for AnyToolBench ensures that all queries are solvable using certain APIs from the API pool. AnyTool's performance improves with more self-reflection rounds, achieving up to a 20% increase in pass rate across all datasets with just 4-6 rounds. The system is designed to be efficient and effective, with a hierarchical structure that reduces the search scope for each agent and overcomes constraints related to the maximum context length in LLMs. AnyTool's self-reflection mechanism also reduces the tendency to oversearch for simpler queries while providing a more context-rich and in-depth search for complex queries. The system is evaluated using a pass rate metric that reflects real-world scenarios, and it outperforms existing methods in various benchmarks. The results show that AnyTool significantly outperforms both the original ToolLLM and GPT-4 using reference APIs, with an average pass rate improvement of +32.6 and +19.3 points, respectively. The system's hierarchical structure and self-reflection mechanism contribute to its superior performance. AnyToolBench evaluates an agent's capability to resolve user queries using the entire API pool, and the results show that AnyTool outperforms other approaches in this setting. The system's performance is further validated through ablation studies, which demonstrate the effectiveness of its main components. The results also show that the size of the API pool and the maximal size of the API-candidate pool have a significant impact on the pass rate. The system's ability to manage a large number of tools is also evaluated, with a trade-off observed between the number of tools managed and the performance of the system. Overall, AnyTool demonstrates superior performance in addressing user queries through its hierarchical API retriever, solver, and self-reflection mechanism,AnyTool is a large language model agent designed to efficiently utilize a vast array of APIs to address user queries. It incorporates three key components: an API retriever with a hierarchical structure, a solver that uses selected API candidates to resolve queries, and a self-reflection mechanism that re-activates AnyTool if the initial solution is impractical. AnyTool leverages the function calling feature of GPT-4, eliminating the need for external training modules. It revisits the evaluation protocol used in previous works and identifies a limitation that leads to an artificially high pass rate. By revising the evaluation protocol to better reflect practical scenarios, AnyTool introduces a new benchmark called AnyToolBench. Experiments across various datasets demonstrate that AnyTool outperforms strong baselines such as ToolLLM and a GPT-4 variant tailored for tool utilization. For instance, AnyTool achieves a +35.4% improvement in average pass rate on ToolBench compared to ToolLLM. AnyTool's hierarchical API retriever efficiently identifies relevant APIs, while its self-reflection mechanism enhances the efficiency and effectiveness of query resolution by allowing the system to re-activate itself when necessary. The evaluation framework for AnyToolBench ensures that all queries are solvable using certain APIs from the API pool. AnyTool's performance improves with more self-reflection rounds, achieving up to a 20% increase in pass rate across all datasets with just 4-6 rounds. The system is designed to be efficient and effective, with a hierarchical structure that reduces the search scope for each agent and overcomes constraints related to the maximum context length in LLMs. AnyTool's self-reflection mechanism also reduces the tendency to oversearch for simpler queries while providing a more context-rich and in-depth search for complex queries. The system is evaluated using a pass rate metric that reflects real-world scenarios, and it outperforms existing methods in various benchmarks. The results show that AnyTool significantly outperforms both the original ToolLLM and GPT-4 using reference APIs, with an average pass rate improvement of +32.6 and +19.3 points, respectively. The system's hierarchical structure and self-reflection mechanism contribute to its superior performance. AnyToolBench evaluates an agent's capability to resolve user queries using the entire API pool, and the results show that AnyTool outperforms other approaches in this setting. The system's performance is further validated through ablation studies, which demonstrate the effectiveness of its main components. The results also show that the size of the API pool and the maximal size of the API-candidate pool have a significant impact on the pass rate. The system's ability to manage a large number of tools is also evaluated, with a trade-off observed between the number of tools managed and the performance of the system. Overall, AnyTool demonstrates superior performance in addressing user queries through its hierarchical API retriever, solver, and self-reflection mechanism,

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

6 Feb 2024 | Yu Du, Fangyun Wei, Hongyang Zhang