LAB: LARGE-SCALE ALIGNMENT FOR CHATBOTS

LAB: LARGE-SCALE ALIGNMENT FOR CHATBOTS

29 Apr 2024 | Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*†
This paper introduces LAB (Large-scale Alignment for chatBots), a novel methodology to address scalability challenges in the instruction-tuning phase of large language model (LLM) training. LAB reduces reliance on expensive human annotations and proprietary models like GPT-4 by using a taxonomy-guided synthetic data generation process and a multi-phase tuning framework. LAB-trained models achieve competitive performance on several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data, offering a scalable, cost-effective solution for enhancing LLM capabilities and instruction-following behaviors without the drawbacks of catastrophic forgetting. LLMs are typically trained in phases: self-supervised pre-training followed by supervised alignment tuning phases. The majority of the cost of training an LLM comes from the pre-training phase, which requires vast amounts of unlabeled data and extensive computational resources. Instruction tuning and preference tuning, which are critical for aligning models with human preferences, are less resource-intensive but still important for effective training. Unlike pre-training, instruction tuning and preference tuning stages involve a small fraction of the overall training procedure in terms of data and compute infrastructure. However, high-quality, human-generated task-specific instruction data is costly to procure and often guarded by model builders. LAB addresses these challenges by proposing a new method called LAB: Large-scale Alignment for chatBots. The LAB method consists of two components: (i) a taxonomy-guided synthetic data generation method and quality assurance process that yields a highly diverse and high-quality instruction dataset without using proprietary LLMs or substantial human curation, and (ii) a novel multi-phase training framework and unconventional tuning regime that allows for adding new knowledge and instruction-following abilities into pre-trained LLMs without suffering from catastrophic forgetting. LAB's methodology includes a taxonomy to enable data curation and guide the synthetic data generator, as well as a multi-phased instruction-tuning method with replay buffers to ensure training stability and prevent catastrophic forgetting. The taxonomy is used to hierarchically classify data samples into smaller task groups, with three main branches: knowledge, foundational skills, and compositional skills. Each branch is further split into more granular levels where tasks are defined in the leaf nodes and exemplified by manually written instruction-response pairs. The synthetic data generation process in LAB is guided by the taxonomy to ensure targeted coverage of the support of the teacher model distribution around the individual leaf nodes of the taxonomy. This approach ensures that the generated data is diverse and of high quality, leading to improved performance in training student models. LAB also introduces two new synthetic data generation methods: skill generation and knowledge generation. Skill generation uses the handful of task examples in the leaf nodes to generate a lot more using the open-source Mixtral-7x8B model. Knowledge generation, on the other hand, uses the Mixtral-7x8B model but does not rely on the knowledge stored in the teacher model. LAB training happens in two phases: knowledge tuning followed by skillsThis paper introduces LAB (Large-scale Alignment for chatBots), a novel methodology to address scalability challenges in the instruction-tuning phase of large language model (LLM) training. LAB reduces reliance on expensive human annotations and proprietary models like GPT-4 by using a taxonomy-guided synthetic data generation process and a multi-phase tuning framework. LAB-trained models achieve competitive performance on several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data, offering a scalable, cost-effective solution for enhancing LLM capabilities and instruction-following behaviors without the drawbacks of catastrophic forgetting. LLMs are typically trained in phases: self-supervised pre-training followed by supervised alignment tuning phases. The majority of the cost of training an LLM comes from the pre-training phase, which requires vast amounts of unlabeled data and extensive computational resources. Instruction tuning and preference tuning, which are critical for aligning models with human preferences, are less resource-intensive but still important for effective training. Unlike pre-training, instruction tuning and preference tuning stages involve a small fraction of the overall training procedure in terms of data and compute infrastructure. However, high-quality, human-generated task-specific instruction data is costly to procure and often guarded by model builders. LAB addresses these challenges by proposing a new method called LAB: Large-scale Alignment for chatBots. The LAB method consists of two components: (i) a taxonomy-guided synthetic data generation method and quality assurance process that yields a highly diverse and high-quality instruction dataset without using proprietary LLMs or substantial human curation, and (ii) a novel multi-phase training framework and unconventional tuning regime that allows for adding new knowledge and instruction-following abilities into pre-trained LLMs without suffering from catastrophic forgetting. LAB's methodology includes a taxonomy to enable data curation and guide the synthetic data generator, as well as a multi-phased instruction-tuning method with replay buffers to ensure training stability and prevent catastrophic forgetting. The taxonomy is used to hierarchically classify data samples into smaller task groups, with three main branches: knowledge, foundational skills, and compositional skills. Each branch is further split into more granular levels where tasks are defined in the leaf nodes and exemplified by manually written instruction-response pairs. The synthetic data generation process in LAB is guided by the taxonomy to ensure targeted coverage of the support of the teacher model distribution around the individual leaf nodes of the taxonomy. This approach ensures that the generated data is diverse and of high quality, leading to improved performance in training student models. LAB also introduces two new synthetic data generation methods: skill generation and knowledge generation. Skill generation uses the handful of task examples in the leaf nodes to generate a lot more using the open-source Mixtral-7x8B model. Knowledge generation, on the other hand, uses the Mixtral-7x8B model but does not rely on the knowledge stored in the teacher model. LAB training happens in two phases: knowledge tuning followed by skills
Reach us at info@study.space