This paper introduces Automatic Tool Chain (ATC), a framework that enables large language models (LLMs) to act as multi-tool users by directly utilizing a chain of tools through programming. The framework addresses two key challenges in existing approaches: (1) the reliance on manually designed workflows that limit the LLM's ability to generalize to new scenarios, and (2) the restriction to only using manually demonstrated tools or well-trained Python functions, which limits the scope of available tools. To overcome these challenges, the authors propose ATC, which allows the LLM to learn the input-output schema and data flow dependency of various tools from tool protocols. Additionally, they introduce a black-box probing method that enables the LLM to actively discover and document tool usages, teaching itself to master new tools.
To evaluate their framework, the authors build a new benchmark called ToolFlow, which includes 224 tasks across 107 real-world tools. This benchmark diverges from previous benchmarks by its long-term planning scenarios and complex toolset. Experiments on both existing datasets and ToolFlow demonstrate the superiority of their framework. Analysis on different settings also validates the effectiveness and utility of their black-box probing algorithm.
The authors also propose a black-box probing method that enables the LLM to act as an active tool learner. This method allows the LLM to probe the input-output schema of new tools and teach itself how to use them. The method involves generating tool-use instances through self-exploration and transforming specific instances into general tool protocols. To handle the interconnection among tools, the authors introduce a chain of probing algorithms that enable the cooperation of tools with strong input-output dependencies.
The framework is evaluated on three datasets: RestBench, which includes two datasets (TMDB and Spotify), and ToolFlow. The results show that the LLM can well understand the tool protocol, exhibits strong capability in planning a chain of tools programmatically, and substantially surpasses previous baselines with higher efficiency. The authors also conduct human evaluations to assess the performance of their method and find that it performs substantially better on executability and utility than strong baselines.
The authors conclude that their framework, ATC, enables the LLM to act as a multi-tool user and a multi-tool learner. The framework allows the LLM to learn input-output schemas and data flow dependencies of various tools from documented tool protocols, programmatically generating a chain of tools to solve complex tasks. The framework overcomes the limitations of existing tool learning methods, including reliance on manually designed workflows and lengthy inference steps. On top of ATC, the authors propose a black-box probing method, empowering the LLM to act as a multi-tool learner that can automatically discover tool protocols and teach itself to master new tools. Extensive experiments conducted on existing datasets and a newly created challenging benchmark demonstrate that an LLM, when equipped with their framework, achieves the best performance compared with all the baselines. The authors expect future research to furtherThis paper introduces Automatic Tool Chain (ATC), a framework that enables large language models (LLMs) to act as multi-tool users by directly utilizing a chain of tools through programming. The framework addresses two key challenges in existing approaches: (1) the reliance on manually designed workflows that limit the LLM's ability to generalize to new scenarios, and (2) the restriction to only using manually demonstrated tools or well-trained Python functions, which limits the scope of available tools. To overcome these challenges, the authors propose ATC, which allows the LLM to learn the input-output schema and data flow dependency of various tools from tool protocols. Additionally, they introduce a black-box probing method that enables the LLM to actively discover and document tool usages, teaching itself to master new tools.
To evaluate their framework, the authors build a new benchmark called ToolFlow, which includes 224 tasks across 107 real-world tools. This benchmark diverges from previous benchmarks by its long-term planning scenarios and complex toolset. Experiments on both existing datasets and ToolFlow demonstrate the superiority of their framework. Analysis on different settings also validates the effectiveness and utility of their black-box probing algorithm.
The authors also propose a black-box probing method that enables the LLM to act as an active tool learner. This method allows the LLM to probe the input-output schema of new tools and teach itself how to use them. The method involves generating tool-use instances through self-exploration and transforming specific instances into general tool protocols. To handle the interconnection among tools, the authors introduce a chain of probing algorithms that enable the cooperation of tools with strong input-output dependencies.
The framework is evaluated on three datasets: RestBench, which includes two datasets (TMDB and Spotify), and ToolFlow. The results show that the LLM can well understand the tool protocol, exhibits strong capability in planning a chain of tools programmatically, and substantially surpasses previous baselines with higher efficiency. The authors also conduct human evaluations to assess the performance of their method and find that it performs substantially better on executability and utility than strong baselines.
The authors conclude that their framework, ATC, enables the LLM to act as a multi-tool user and a multi-tool learner. The framework allows the LLM to learn input-output schemas and data flow dependencies of various tools from documented tool protocols, programmatically generating a chain of tools to solve complex tasks. The framework overcomes the limitations of existing tool learning methods, including reliance on manually designed workflows and lengthy inference steps. On top of ATC, the authors propose a black-box probing method, empowering the LLM to act as a multi-tool learner that can automatically discover tool protocols and teach itself to master new tools. Extensive experiments conducted on existing datasets and a newly created challenging benchmark demonstrate that an LLM, when equipped with their framework, achieves the best performance compared with all the baselines. The authors expect future research to further