APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

26 Jun 2024 | Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong
APIGen is an automated pipeline for generating verifiable and diverse function-calling datasets. The paper introduces APIGen, which synthesizes high-quality datasets for function-calling applications by leveraging a multi-stage verification process. The framework collects 3,673 executable APIs across 21 categories to generate diverse datasets in a scalable and structured manner. Each data point undergoes three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring reliability and correctness. The datasets are verified through rigorous checks, including format, execution, and semantic validation, to ensure accuracy and applicability. The paper demonstrates that models trained with these datasets achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. A dataset containing 60,000 high-quality entries is released, aiming to advance the field of function-calling agents. The dataset is available on Huggingface and the project homepage. The framework is designed to facilitate the fine-tuning of function-calling LLMs by providing high-quality, diverse datasets that better reflect the variability and complexity of real-world API use. The paper also discusses the dataset preparation process, including API sources, collection setup, and dataset details. The experiments show that the APIGen framework effectively filters out low-quality data, leading to improved model performance. The results highlight the effectiveness of APIGen in generating high-quality, diverse datasets for function-calling tasks, enabling smaller models to achieve competitive results. The paper concludes that APIGen represents a significant step forward in the development of efficient and effective function-calling agents.APIGen is an automated pipeline for generating verifiable and diverse function-calling datasets. The paper introduces APIGen, which synthesizes high-quality datasets for function-calling applications by leveraging a multi-stage verification process. The framework collects 3,673 executable APIs across 21 categories to generate diverse datasets in a scalable and structured manner. Each data point undergoes three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring reliability and correctness. The datasets are verified through rigorous checks, including format, execution, and semantic validation, to ensure accuracy and applicability. The paper demonstrates that models trained with these datasets achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. A dataset containing 60,000 high-quality entries is released, aiming to advance the field of function-calling agents. The dataset is available on Huggingface and the project homepage. The framework is designed to facilitate the fine-tuning of function-calling LLMs by providing high-quality, diverse datasets that better reflect the variability and complexity of real-world API use. The paper also discusses the dataset preparation process, including API sources, collection setup, and dataset details. The experiments show that the APIGen framework effectively filters out low-quality data, leading to improved model performance. The results highlight the effectiveness of APIGen in generating high-quality, diverse datasets for function-calling tasks, enabling smaller models to achieve competitive results. The paper concludes that APIGen represents a significant step forward in the development of efficient and effective function-calling agents.
Reach us at info@study.space
[slides] APIGen%3A Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets | StudySpace