26 Jun 2024 | Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan
LOOK-M is a pioneering, fine-tuning-free approach designed to efficiently reduce the multimodal Key-Value (KV) cache size in long-context Multimodal Large Language Models (MLLMs). Unlike traditional optimizations for single-modality LLMs, which primarily manage textual contexts, MLLMs include representations from multiple images with temporal and spatial relationships, making traditional optimizations unsuitable. LOOK-M introduces a text-prior method that prioritizes the retention of textual KV pairs during the prompt encoding phase, based on the observation that the model tends to prioritize textual attention over image features. This method, combined with various merging strategies, effectively compresses the KV cache while maintaining or enhancing performance across a variety of long-context multimodal tasks. The approach demonstrates a significant reduction in KV cache memory usage, up to 80% in some cases, and achieves up to 1.5x faster decoding while preserving or improving performance. The method does not require fine-tuning and can be applied plug-and-play, making it highly adaptable to different multimodal datasets and architectures.LOOK-M is a pioneering, fine-tuning-free approach designed to efficiently reduce the multimodal Key-Value (KV) cache size in long-context Multimodal Large Language Models (MLLMs). Unlike traditional optimizations for single-modality LLMs, which primarily manage textual contexts, MLLMs include representations from multiple images with temporal and spatial relationships, making traditional optimizations unsuitable. LOOK-M introduces a text-prior method that prioritizes the retention of textual KV pairs during the prompt encoding phase, based on the observation that the model tends to prioritize textual attention over image features. This method, combined with various merging strategies, effectively compresses the KV cache while maintaining or enhancing performance across a variety of long-context multimodal tasks. The approach demonstrates a significant reduction in KV cache memory usage, up to 80% in some cases, and achieves up to 1.5x faster decoding while preserving or improving performance. The method does not require fine-tuning and can be applied plug-and-play, making it highly adaptable to different multimodal datasets and architectures.