Understanding Synthetic Data (Almost) from Scratch%3A Generalized Instruction Tuning for Language Models

The paper introduces Generalized Instruction Tuning (GLAN), a method for generating synthetic instruction data for Large Language Models (LLMs). Unlike previous methods that rely on seed examples or existing datasets, GLAN uses a pre-curated taxonomy of human knowledge and capabilities to generate diverse and comprehensive instruction data across various disciplines. The taxonomy is built by decomposing human knowledge into fields, sub-fields, and disciplines, facilitated by LLMs and human verification. GLAN then generates a syllabus for each subject, breaking down content into class sessions and key concepts. This detailed structure allows for the generation of diverse instructions covering a wide range of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions, including mathematical reasoning, coding, academic exams, logical reasoning, and general instruction following, without using task-specific training data. GLAN is also scalable, customizable, and capable of covering a broad range of domains. The method is evaluated on various benchmarks, showing superior performance compared to other models, particularly in STEM subjects.The paper introduces Generalized Instruction Tuning (GLAN), a method for generating synthetic instruction data for Large Language Models (LLMs). Unlike previous methods that rely on seed examples or existing datasets, GLAN uses a pre-curated taxonomy of human knowledge and capabilities to generate diverse and comprehensive instruction data across various disciplines. The taxonomy is built by decomposing human knowledge into fields, sub-fields, and disciplines, facilitated by LLMs and human verification. GLAN then generates a syllabus for each subject, breaking down content into class sessions and key concepts. This detailed structure allows for the generation of diverse instructions covering a wide range of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions, including mathematical reasoning, coding, academic exams, logical reasoning, and general instruction following, without using task-specific training data. GLAN is also scalable, customizable, and capable of covering a broad range of domains. The method is evaluated on various benchmarks, showing superior performance compared to other models, particularly in STEM subjects.

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

20 Feb 2024 | Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei