[slides] Data Engineering for Scaling Language Models to 128K Context

This paper investigates the data engineering techniques for scaling language models to handle context lengths of 128K tokens, focusing on the ability to utilize information at arbitrary input locations. The authors hypothesize that this capability is largely acquired during large-scale pretraining and can be extended to longer contexts through lightweight continual pretraining on appropriate data mixtures. They explore the quantity and quality of data required for effective continual pretraining, finding that 500 million to 5 billion tokens are sufficient, and emphasizing the importance of domain balance and length upscaling. The study demonstrates that continual pretraining of full models on 1-5 billion tokens of per-source-length upsampled data is an effective and affordable strategy. The proposed method outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K, particularly in the Needle-in-a-Haystack test, which assesses precise retrieval within long documents. The paper also highlights the importance of maintaining domain balance during data upscaling and the limitations of existing approaches that focus solely on upsampling long sequences. The results suggest that long-context continual pretraining can be a separate stage after code and math pretraining, and that further research is needed to explore instruction fine-tuning for tasks requiring 100K context lengths.This paper investigates the data engineering techniques for scaling language models to handle context lengths of 128K tokens, focusing on the ability to utilize information at arbitrary input locations. The authors hypothesize that this capability is largely acquired during large-scale pretraining and can be extended to longer contexts through lightweight continual pretraining on appropriate data mixtures. They explore the quantity and quality of data required for effective continual pretraining, finding that 500 million to 5 billion tokens are sufficient, and emphasizing the importance of domain balance and length upscaling. The study demonstrates that continual pretraining of full models on 1-5 billion tokens of per-source-length upsampled data is an effective and affordable strategy. The proposed method outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K, particularly in the Needle-in-a-Haystack test, which assesses precise retrieval within long documents. The paper also highlights the importance of maintaining domain balance during data upscaling and the limitations of existing approaches that focus solely on upsampling long sequences. The results suggest that long-context continual pretraining can be a separate stage after code and math pretraining, and that further research is needed to explore instruction fine-tuning for tasks requiring 100K context lengths.

Data Engineering for Scaling Language Models to 128K Context

15 Feb 2024 | Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng