2024 | Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong
This paper introduces Dual Chunk Attention (DCA), a training-free framework that enables Large Language Models (LLMs) to scale their context window without requiring additional training. DCA effectively captures both intra-chunk and inter-chunk relative positional information, and seamlessly integrates with Flash Attention. By decomposing long sequences into smaller chunks, DCA allows LLaMA2 70B to support context windows of over 100k tokens without further training. DCA achieves performance comparable to or better than finetuned models on practical long-context tasks. When compared to proprietary models, the training-free 70B model achieves 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. The framework is evaluated on various tasks including language modeling, passkey retrieval, and real-world long-context applications. DCA demonstrates significant improvements in extrapolation capability, orthogonality with existing positional encodings, and long-context understanding. The method is efficient, compatible with Flash Attention, and requires minimal computational resources. The results show that DCA can be applied to various LLMs, including Llama2, Together-32k, and CodeLlama, to extend their context window while maintaining performance. The framework is validated on multiple benchmarks, demonstrating its effectiveness in both training-free and finetuned settings. The results indicate that DCA provides a cost-effective solution for managing long-context scenarios in LLM applications.This paper introduces Dual Chunk Attention (DCA), a training-free framework that enables Large Language Models (LLMs) to scale their context window without requiring additional training. DCA effectively captures both intra-chunk and inter-chunk relative positional information, and seamlessly integrates with Flash Attention. By decomposing long sequences into smaller chunks, DCA allows LLaMA2 70B to support context windows of over 100k tokens without further training. DCA achieves performance comparable to or better than finetuned models on practical long-context tasks. When compared to proprietary models, the training-free 70B model achieves 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. The framework is evaluated on various tasks including language modeling, passkey retrieval, and real-world long-context applications. DCA demonstrates significant improvements in extrapolation capability, orthogonality with existing positional encodings, and long-context understanding. The method is efficient, compatible with Flash Attention, and requires minimal computational resources. The results show that DCA can be applied to various LLMs, including Llama2, Together-32k, and CodeLlama, to extend their context window while maintaining performance. The framework is validated on multiple benchmarks, demonstrating its effectiveness in both training-free and finetuned settings. The results indicate that DCA provides a cost-effective solution for managing long-context scenarios in LLM applications.