29 May 2024 | Chenxin An * 1 2 Fei Huang 2 Jun Zhang Shansan Gong 1 Xipeng Qiu 3 Chang Zhou 2 Lingpeng Kong 1
The paper introduces Dual Chunk Attention (DCA), a novel training-free framework that enables large language models (LLMs) to process and generate coherent text with context windows exceeding their pretraining length. DCA decomposes attention computation into chunk-based modules, capturing relative positional information within and across chunks, and integrates seamlessly with Flash Attention. This approach allows LLMs like Llama2 70B to support context windows of over 100k tokens without additional training. The method is evaluated on various tasks, including language modeling, passkey retrieval, and real-world applications, demonstrating significant extrapolation capabilities and performance comparable to or better than finetuned models. The paper also provides a comprehensive evaluation of DCA's effectiveness, showing its orthogonality to existing scaled positional encodings and its ability to maintain long-range dependencies. The results highlight the potential of DCA as a viable open-source alternative for handling long-context scenarios in LLM applications.The paper introduces Dual Chunk Attention (DCA), a novel training-free framework that enables large language models (LLMs) to process and generate coherent text with context windows exceeding their pretraining length. DCA decomposes attention computation into chunk-based modules, capturing relative positional information within and across chunks, and integrates seamlessly with Flash Attention. This approach allows LLMs like Llama2 70B to support context windows of over 100k tokens without additional training. The method is evaluated on various tasks, including language modeling, passkey retrieval, and real-world applications, demonstrating significant extrapolation capabilities and performance comparable to or better than finetuned models. The paper also provides a comprehensive evaluation of DCA's effectiveness, showing its orthogonality to existing scaled positional encodings and its ability to maintain long-range dependencies. The results highlight the potential of DCA as a viable open-source alternative for handling long-context scenarios in LLM applications.