14 May 2024 | Mengsong Wu, Tong Zhu, Han Han, Chuanyuan Tan, Xiang Zhang, Wenliang Chen
This paper introduces Seal-Tools, a self-instruct tool learning dataset designed for agent tuning and detailed benchmarking. Seal-Tools contains a large number of self-instruct API-like tools and instances that demonstrate their practical applications. To generate data on a large scale while ensuring reliability, the authors propose a self-instruct method to generate tools and instances, allowing precise control over the process. The dataset includes hard instances that call multiple tools, some of which involve nested tool callings. For precise and comprehensive evaluation, the authors use strict format control and design three metrics from different dimensions. Seal-Tools can serve as a new benchmark to evaluate the tool-calling ability of LLMs. The authors evaluate several prevalent LLMs and their fine-tuned model on Seal-Tools, finding that current systems are far from perfect.
The dataset is constructed using a self-instruct method, where LLMs are used to generate fields, tools, and instances. The fields are categorized into specific domains, and tools are generated for each field. The instances are generated by calling single or multiple tools to resolve requests. The generation process is divided into multiple steps with checking steps to reduce errors caused by LLM hallucination. The tools and instances are described in JSON format, and the dataset includes instances with nested tool callings, which are extremely difficult to solve and valuable for fine-tuning.
To make Seal-Tools a comprehensive benchmark, the authors design three evaluation dimensions: Output Format, Tool Selection, and Tool-Parameter Filling-in. The dataset is compared with several existing tool learning datasets, showing that it is competitive. The dataset includes a large number of instances, including cross-field and nested callings, which test the LLM's ability to think logically. The dataset provides detailed evaluations, including tool selection and parameter filling-in, which are more precise than previous benchmarks.
The authors evaluate the performance of several LLMs on Seal-Tools, finding that current systems have room for improvement, especially in nested calling. The results show that the fine-tuned model outperforms the base model in tool calling and parameter filling-in. The authors also analyze the performance of different instances, finding that single-tool instances are generally easier than multiple-tool instances. The nested instances are the most difficult for models to solve, but the fine-tuned model performs better than the raw model.
The authors also analyze the error types made by the model, finding that the main errors are due to the model not extracting the correct keywords from queries and not understanding the query requirements. The authors conclude that Seal-Tools is a high-quality dataset with hard instances and can serve as a new benchmark for evaluating the tool-calling ability of LLMs.This paper introduces Seal-Tools, a self-instruct tool learning dataset designed for agent tuning and detailed benchmarking. Seal-Tools contains a large number of self-instruct API-like tools and instances that demonstrate their practical applications. To generate data on a large scale while ensuring reliability, the authors propose a self-instruct method to generate tools and instances, allowing precise control over the process. The dataset includes hard instances that call multiple tools, some of which involve nested tool callings. For precise and comprehensive evaluation, the authors use strict format control and design three metrics from different dimensions. Seal-Tools can serve as a new benchmark to evaluate the tool-calling ability of LLMs. The authors evaluate several prevalent LLMs and their fine-tuned model on Seal-Tools, finding that current systems are far from perfect.
The dataset is constructed using a self-instruct method, where LLMs are used to generate fields, tools, and instances. The fields are categorized into specific domains, and tools are generated for each field. The instances are generated by calling single or multiple tools to resolve requests. The generation process is divided into multiple steps with checking steps to reduce errors caused by LLM hallucination. The tools and instances are described in JSON format, and the dataset includes instances with nested tool callings, which are extremely difficult to solve and valuable for fine-tuning.
To make Seal-Tools a comprehensive benchmark, the authors design three evaluation dimensions: Output Format, Tool Selection, and Tool-Parameter Filling-in. The dataset is compared with several existing tool learning datasets, showing that it is competitive. The dataset includes a large number of instances, including cross-field and nested callings, which test the LLM's ability to think logically. The dataset provides detailed evaluations, including tool selection and parameter filling-in, which are more precise than previous benchmarks.
The authors evaluate the performance of several LLMs on Seal-Tools, finding that current systems have room for improvement, especially in nested calling. The results show that the fine-tuned model outperforms the base model in tool calling and parameter filling-in. The authors also analyze the performance of different instances, finding that single-tool instances are generally easier than multiple-tool instances. The nested instances are the most difficult for models to solve, but the fine-tuned model performs better than the raw model.
The authors also analyze the error types made by the model, finding that the main errors are due to the model not extracting the correct keywords from queries and not understanding the query requirements. The authors conclude that Seal-Tools is a high-quality dataset with hard instances and can serve as a new benchmark for evaluating the tool-calling ability of LLMs.