26 Jun 2024 | Zhongwei Wan, Ziang Wu, Che Liu, Jinfu Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan
LOOK-M is a novel approach for efficient multimodal long-context inference that reduces the size of the key-value (KV) cache while maintaining performance comparable to a full cache. Unlike traditional methods that focus on text-centric KV cache optimization, LOOK-M is specifically designed for multimodal long-context scenarios. It introduces a text-prior method to compress the KV cache by prioritizing the retention of textual KV pairs during the prompt encoding phase. Additionally, it proposes several compensatory strategies using KV pair merging to mitigate the degradation of image contextual information. LOOK-M achieves up to 1.5x faster decoding and reduces KV cache memory usage by up to 95%, while maintaining or even enhancing performance across various long context multimodal tasks. The method does not require fine-tuning and can be applied in a plug-and-play manner. Experimental results show that LOOK-M outperforms existing baselines in terms of performance and efficiency, demonstrating its effectiveness in compressing multimodal KV caches without significant loss of information. The method is evaluated on several recent MLLM backbones across various multimodal long-context tasks, showing consistent performance improvements. The approach leverages attention map interactions between text and images to guide KV cache pruning, ensuring that critical information is preserved. The method also explores various merging strategies, including averaged, pivotal, and weighted merging, to further enhance the effectiveness of KV cache compression. Overall, LOOK-M provides a promising solution for efficient multimodal long-context inference by effectively reducing the KV cache size while maintaining high performance.LOOK-M is a novel approach for efficient multimodal long-context inference that reduces the size of the key-value (KV) cache while maintaining performance comparable to a full cache. Unlike traditional methods that focus on text-centric KV cache optimization, LOOK-M is specifically designed for multimodal long-context scenarios. It introduces a text-prior method to compress the KV cache by prioritizing the retention of textual KV pairs during the prompt encoding phase. Additionally, it proposes several compensatory strategies using KV pair merging to mitigate the degradation of image contextual information. LOOK-M achieves up to 1.5x faster decoding and reduces KV cache memory usage by up to 95%, while maintaining or even enhancing performance across various long context multimodal tasks. The method does not require fine-tuning and can be applied in a plug-and-play manner. Experimental results show that LOOK-M outperforms existing baselines in terms of performance and efficiency, demonstrating its effectiveness in compressing multimodal KV caches without significant loss of information. The method is evaluated on several recent MLLM backbones across various multimodal long-context tasks, showing consistent performance improvements. The approach leverages attention map interactions between text and images to guide KV cache pruning, ensuring that critical information is preserved. The method also explores various merging strategies, including averaged, pivotal, and weighted merging, to further enhance the effectiveness of KV cache compression. Overall, LOOK-M provides a promising solution for efficient multimodal long-context inference by effectively reducing the KV cache size while maintaining high performance.