Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

20 Feb 2024 | Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei
GLAN is a general and scalable method for instruction tuning of large language models (LLMs). Unlike prior work that relies on seed examples or existing datasets, GLAN uses a pre-curated taxonomy of human knowledge and capabilities to generate large-scale synthetic instruction data across all disciplines. The taxonomy is constructed by decomposing human knowledge and capabilities into fields, sub-fields, and disciplines, facilitated by LLMs and human verification. GLAN then generates a comprehensive list of subjects for each discipline and designs a syllabus tailored to each subject, again using LLMs. With the fine-grained key concepts detailed in each class session of the syllabus, GLAN generates diverse instructions covering a broad spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) show that GLAN excels in multiple dimensions, including mathematical reasoning, coding, academic exams, logical reasoning, and general instruction following, without using task-specific training data. GLAN is also customizable, allowing for easy addition of new fields or skills by incorporating new nodes into the taxonomy. GLAN is general, scalable, and customizable, capable of covering a broad range of domains. It generates instructions using LLMs, which can produce instructions in a massive scale. The input of GLAN is a taxonomy, which is generated by prompting an LLM and human verification, requiring minimal human effort. GLAN allows for easy customization, with new fields or skills added by simply incorporating a new node into the taxonomy. The process described mirrors the human educational system, where educators craft a series of subjects for student learning, and instructors develop a syllabus for each subject, breaking down the content into specific class sessions. These sessions are then further divided into core concepts that students must comprehend and internalize. Based on these detailed core concepts, teaching materials and exercises are created, which are the instruction tuning data. GLAN is evaluated on various benchmarks, including mathematical reasoning, coding, logical reasoning, and academic exams, showing strong performance across these tasks. GLAN is also evaluated on instruction following capabilities, demonstrating superior performance compared to other models on most difficulty levels and overall scores. GLAN is a generalized method for creating synthetic data for instruction tuning, and its synthetic data is diverse and effective for assessing the generalization capabilities of LLMs. GLAN is also evaluated on a held-out test set, showing strong performance across various disciplines. GLAN is a general and scalable method for instruction tuning, capable of improving the capabilities of large language models in multiple dimensions.GLAN is a general and scalable method for instruction tuning of large language models (LLMs). Unlike prior work that relies on seed examples or existing datasets, GLAN uses a pre-curated taxonomy of human knowledge and capabilities to generate large-scale synthetic instruction data across all disciplines. The taxonomy is constructed by decomposing human knowledge and capabilities into fields, sub-fields, and disciplines, facilitated by LLMs and human verification. GLAN then generates a comprehensive list of subjects for each discipline and designs a syllabus tailored to each subject, again using LLMs. With the fine-grained key concepts detailed in each class session of the syllabus, GLAN generates diverse instructions covering a broad spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) show that GLAN excels in multiple dimensions, including mathematical reasoning, coding, academic exams, logical reasoning, and general instruction following, without using task-specific training data. GLAN is also customizable, allowing for easy addition of new fields or skills by incorporating new nodes into the taxonomy. GLAN is general, scalable, and customizable, capable of covering a broad range of domains. It generates instructions using LLMs, which can produce instructions in a massive scale. The input of GLAN is a taxonomy, which is generated by prompting an LLM and human verification, requiring minimal human effort. GLAN allows for easy customization, with new fields or skills added by simply incorporating a new node into the taxonomy. The process described mirrors the human educational system, where educators craft a series of subjects for student learning, and instructors develop a syllabus for each subject, breaking down the content into specific class sessions. These sessions are then further divided into core concepts that students must comprehend and internalize. Based on these detailed core concepts, teaching materials and exercises are created, which are the instruction tuning data. GLAN is evaluated on various benchmarks, including mathematical reasoning, coding, logical reasoning, and academic exams, showing strong performance across these tasks. GLAN is also evaluated on instruction following capabilities, demonstrating superior performance compared to other models on most difficulty levels and overall scores. GLAN is a generalized method for creating synthetic data for instruction tuning, and its synthetic data is diverse and effective for assessing the generalization capabilities of LLMs. GLAN is also evaluated on a held-out test set, showing strong performance across various disciplines. GLAN is a general and scalable method for instruction tuning, capable of improving the capabilities of large language models in multiple dimensions.
Reach us at info@study.space