Instruction Pre-Training: Language Models are Supervised Multitask Learners

Instruction Pre-Training: Language Models are Supervised Multitask Learners

20 Jun 2024 | Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei
Instruction Pre-Training is a framework for supervised multitask learning in language models (LMs), which augments raw corpora with instruction-response pairs generated by an instruction synthesizer. This approach enhances pre-trained base models and benefits further instruction tuning. The instruction synthesizer, developed through multitask fine-tuning on a language model, generates diverse instruction-response pairs based on various raw corpora. The synthesized data is used to pre-train LMs, leading to improved performance in both general pre-training from scratch and domain-adaptive continual pre-training. In general pre-training, the 500M model pre-trained on 100B tokens achieves performance comparable to the 1B model pre-trained on 300B tokens. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. The model, code, and data are available at https://github.com/microsoft/LMOps. The instruction synthesizer is trained on a diverse range of datasets, enabling the generation of instruction-response pairs with high knowledge coverage and correctness. The method is effective in enhancing model generalization and is more cost-effective than using large or closed-source models. The experiments show that Instruction Pre-Training significantly improves performance on various benchmarks, including MMLU, and outperforms Vanilla Pre-Training in most domain-specific tasks. The instruction-augmented corpora are analyzed for context relevance, response accuracy, and task diversity, showing high quality and coverage. The method is complementary to other works in synthetic instruction generation and offers a promising solution for supervised multitask pre-training. The work also highlights the importance of data curation and the potential limitations of synthetic data, such as the risk of hallucinations and the need for post-verification techniques. The study contributes to the field of language models by proposing a new approach for supervised multitask pre-training and demonstrating its effectiveness in enhancing model performance.Instruction Pre-Training is a framework for supervised multitask learning in language models (LMs), which augments raw corpora with instruction-response pairs generated by an instruction synthesizer. This approach enhances pre-trained base models and benefits further instruction tuning. The instruction synthesizer, developed through multitask fine-tuning on a language model, generates diverse instruction-response pairs based on various raw corpora. The synthesized data is used to pre-train LMs, leading to improved performance in both general pre-training from scratch and domain-adaptive continual pre-training. In general pre-training, the 500M model pre-trained on 100B tokens achieves performance comparable to the 1B model pre-trained on 300B tokens. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. The model, code, and data are available at https://github.com/microsoft/LMOps. The instruction synthesizer is trained on a diverse range of datasets, enabling the generation of instruction-response pairs with high knowledge coverage and correctness. The method is effective in enhancing model generalization and is more cost-effective than using large or closed-source models. The experiments show that Instruction Pre-Training significantly improves performance on various benchmarks, including MMLU, and outperforms Vanilla Pre-Training in most domain-specific tasks. The instruction-augmented corpora are analyzed for context relevance, response accuracy, and task diversity, showing high quality and coverage. The method is complementary to other works in synthetic instruction generation and offers a promising solution for supervised multitask pre-training. The work also highlights the importance of data curation and the potential limitations of synthetic data, such as the risk of hallucinations and the need for post-verification techniques. The study contributes to the field of language models by proposing a new approach for supervised multitask pre-training and demonstrating its effectiveness in enhancing model performance.
Reach us at info@futurestudyspace.com
[slides] Instruction Pre-Training%3A Language Models are Supervised Multitask Learners | StudySpace