15 Feb 2024 | Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng
This paper presents a data engineering approach to scale language models to 128K context length. The authors hypothesize that the ability to retrieve information from arbitrary locations in long contexts is already acquired through large-scale pretraining, and that lightweight continual pretraining on appropriate data mixtures can extend this capability to much longer contexts. They investigate the quantity and quality of data needed for continual pretraining, finding that 500 million to 5 billion tokens are sufficient to enable the model to retrieve information anywhere within the 128K context. They emphasize the importance of domain balance and length upsampling, showing that naive upsampling of longer data in certain domains like books leads to suboptimal performance. Instead, a balanced domain mixture is crucial. The authors demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Their approach outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K. The key data recipe involves per-source length upsampled data, which retains the domain mixture while increasing the length of sequences. This approach provides the most balanced performance gain. The authors also show that upsampling long sequences while retaining the domain mixture is crucial for context scaling, and that using only validation loss as evaluation can obscure the underlying differences in retrieval capability. Their method achieves strong performance on the Needle-in-a-Haystack test and other long-context benchmarks, demonstrating the effectiveness of their data engineering approach. The paper also discusses the infrastructure and engineering required for training on long contexts, showing that it is feasible under academic-level resources. The authors conclude that their approach provides a foundation for future research on long-context instruction finetuning.This paper presents a data engineering approach to scale language models to 128K context length. The authors hypothesize that the ability to retrieve information from arbitrary locations in long contexts is already acquired through large-scale pretraining, and that lightweight continual pretraining on appropriate data mixtures can extend this capability to much longer contexts. They investigate the quantity and quality of data needed for continual pretraining, finding that 500 million to 5 billion tokens are sufficient to enable the model to retrieve information anywhere within the 128K context. They emphasize the importance of domain balance and length upsampling, showing that naive upsampling of longer data in certain domains like books leads to suboptimal performance. Instead, a balanced domain mixture is crucial. The authors demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Their approach outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K. The key data recipe involves per-source length upsampled data, which retains the domain mixture while increasing the length of sequences. This approach provides the most balanced performance gain. The authors also show that upsampling long sequences while retaining the domain mixture is crucial for context scaling, and that using only validation loss as evaluation can obscure the underlying differences in retrieval capability. Their method achieves strong performance on the Needle-in-a-Haystack test and other long-context benchmarks, demonstrating the effectiveness of their data engineering approach. The paper also discusses the infrastructure and engineering required for training on long contexts, showing that it is feasible under academic-level resources. The authors conclude that their approach provides a foundation for future research on long-context instruction finetuning.