[slides and audio] Instruction Pre-Training%3A Language Models are Supervised Multitask Learners

This paper introduces *Instruction Pre-Training* (Instruct PT), a framework that enhances language model (LM) pre-training by augmenting raw corpora with instruction-response pairs generated by an efficient instruction synthesizer. The synthesizer, fine-tuned from an open-source model, can scale up task synthesis with high diversity and quality. Experiments show that Instruct PT significantly improves pre-trained base models and benefits more from further instruction tuning. In continual pre-training, Instruct PT enables Llama3-8B to perform as well as or better than Llama3-70B in two domains: finance and biomedicine. The paper also analyzes the effectiveness of the synthesizer and the synthetic data, demonstrating their positive impact on LM generalization. The contributions include the proposal of Instruct PT, the development of an efficient instruction synthesizer, and a comprehensive analysis of the method's success factors. The code and data are available at https://github.com/microsoft/LMOps.This paper introduces *Instruction Pre-Training* (Instruct PT), a framework that enhances language model (LM) pre-training by augmenting raw corpora with instruction-response pairs generated by an efficient instruction synthesizer. The synthesizer, fine-tuned from an open-source model, can scale up task synthesis with high diversity and quality. Experiments show that Instruct PT significantly improves pre-trained base models and benefits more from further instruction tuning. In continual pre-training, Instruct PT enables Llama3-8B to perform as well as or better than Llama3-70B in two domains: finance and biomedicine. The paper also analyzes the effectiveness of the synthesizer and the synthetic data, demonstrating their positive impact on LM generalization. The contributions include the proposal of Instruct PT, the development of an efficient instruction synthesizer, and a comprehensive analysis of the method's success factors. The code and data are available at https://github.com/microsoft/LMOps.

Instruction Pre-Training: Language Models are Supervised Multitask Learners

20 Jun 2024 | Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei