28 May 2024 | Chaojun Xiao, Penge Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun
InfLLM is a training-free method for long-context extrapolation in large language models (LLMs). It enables LLMs to process extremely long sequences without additional training by utilizing an efficient context memory. The method stores distant contexts in additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. This allows LLMs to efficiently process long sequences with a limited context window and well capture long-distance dependencies. InfLLM achieves comparable performance with competitive baselines that continually train on long sequences, even when the sequence length is scaled to 1,024K. The method uses a block-level context memory mechanism, which organizes past key-value vectors into blocks and selects semantically significant tokens as unit representations for subsequent relevance computation. This design offers effective and efficient lookup, reducing computational costs and memory usage. InfLLM is evaluated on two benchmarks, $ \infty $-Bench and Longbench, demonstrating its effectiveness in processing long sequences. The results show that InfLLM can achieve comparable performance with models that have undergone continual training on long sequences, with significantly reduced computational and memory costs. InfLLM is also compared to retrieval-augmented generation (RAG) and shows superior generalization capabilities. The method is further evaluated on sequences with different lengths, demonstrating its ability to accurately capture long-distance dependencies for effective long-sequence reasoning. The experiments show that InfLLM can process sequences up to 1,024K tokens on a single GPU, outperforming other methods. The method is also compared to models with continual training, showing that InfLLM can achieve comparable or even superior results without additional training. The results indicate that InfLLM is a practical and efficient approach for improving the length generalizability of LLMs.InfLLM is a training-free method for long-context extrapolation in large language models (LLMs). It enables LLMs to process extremely long sequences without additional training by utilizing an efficient context memory. The method stores distant contexts in additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. This allows LLMs to efficiently process long sequences with a limited context window and well capture long-distance dependencies. InfLLM achieves comparable performance with competitive baselines that continually train on long sequences, even when the sequence length is scaled to 1,024K. The method uses a block-level context memory mechanism, which organizes past key-value vectors into blocks and selects semantically significant tokens as unit representations for subsequent relevance computation. This design offers effective and efficient lookup, reducing computational costs and memory usage. InfLLM is evaluated on two benchmarks, $ \infty $-Bench and Longbench, demonstrating its effectiveness in processing long sequences. The results show that InfLLM can achieve comparable performance with models that have undergone continual training on long sequences, with significantly reduced computational and memory costs. InfLLM is also compared to retrieval-augmented generation (RAG) and shows superior generalization capabilities. The method is further evaluated on sequences with different lengths, demonstrating its ability to accurately capture long-distance dependencies for effective long-sequence reasoning. The experiments show that InfLLM can process sequences up to 1,024K tokens on a single GPU, outperforming other methods. The method is also compared to models with continual training, showing that InfLLM can achieve comparable or even superior results without additional training. The results indicate that InfLLM is a practical and efficient approach for improving the length generalizability of LLMs.