25 Jan 2025 | Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji
This paper proposes a method called Visual Tokens Withdrawal (VTW) to accelerate the inference of multimodal large language models (MLLMs). The key idea is to remove vision tokens in specific layers of MLLMs, as they are found to be unnecessary in deep layers due to the migration of visual information to subsequent text tokens. The method is inspired by two phenomena: (1) the attention sink phenomenon, where initial and nearest tokens receive most attention, while middle vision tokens receive minimal attention in deep layers; and (2) information migration, where visual information is transferred to subsequent text tokens in the first few layers of MLLMs.
To determine the optimal layer for VTW, the authors analyze a limited set of tiny datasets and select the first layer that meets the Kullback-Leibler divergence criterion. The VTW approach reduces computational overhead by over 40% across various multimodal tasks while maintaining performance. The method is compatible with KV Cache and Flash-attention, making it suitable for real-time applications.
The authors conduct extensive experiments on various multimodal tasks, including visual question answering, hallucination evaluation, visual reasoning, and video understanding, demonstrating that VTW significantly reduces FLOPs without compromising performance. The method is also applicable to multimodal chatbots, achieving accelerated inference with imperceptible differences in the answers.
The paper also discusses the limitations of existing methods for reducing computational costs in MLLMs, such as lack of flexibility, incomplete importance metrics, incompatibility with KV Cache, and inconsistency with Flash-attention. The authors argue that VTW provides a more comprehensive solution by removing vision tokens in specific layers, ensuring flexibility across various tasks and compatibility with existing inference mechanisms.
The results show that VTW outperforms existing methods in terms of computational efficiency and maintains high performance across different tasks. The method is applicable to various MLLMs, including LLaVA, LLaVA-NeXT, and Video-LLaVA, and has been tested on a wide range of downstream tasks, including segmentation and reasoning. The paper concludes that VTW is an effective method for accelerating MLLMs while maintaining performance, making it a valuable tool for real-time applications.This paper proposes a method called Visual Tokens Withdrawal (VTW) to accelerate the inference of multimodal large language models (MLLMs). The key idea is to remove vision tokens in specific layers of MLLMs, as they are found to be unnecessary in deep layers due to the migration of visual information to subsequent text tokens. The method is inspired by two phenomena: (1) the attention sink phenomenon, where initial and nearest tokens receive most attention, while middle vision tokens receive minimal attention in deep layers; and (2) information migration, where visual information is transferred to subsequent text tokens in the first few layers of MLLMs.
To determine the optimal layer for VTW, the authors analyze a limited set of tiny datasets and select the first layer that meets the Kullback-Leibler divergence criterion. The VTW approach reduces computational overhead by over 40% across various multimodal tasks while maintaining performance. The method is compatible with KV Cache and Flash-attention, making it suitable for real-time applications.
The authors conduct extensive experiments on various multimodal tasks, including visual question answering, hallucination evaluation, visual reasoning, and video understanding, demonstrating that VTW significantly reduces FLOPs without compromising performance. The method is also applicable to multimodal chatbots, achieving accelerated inference with imperceptible differences in the answers.
The paper also discusses the limitations of existing methods for reducing computational costs in MLLMs, such as lack of flexibility, incomplete importance metrics, incompatibility with KV Cache, and inconsistency with Flash-attention. The authors argue that VTW provides a more comprehensive solution by removing vision tokens in specific layers, ensuring flexibility across various tasks and compatibility with existing inference mechanisms.
The results show that VTW outperforms existing methods in terms of computational efficiency and maintains high performance across different tasks. The method is applicable to various MLLMs, including LLaVA, LLaVA-NeXT, and Video-LLaVA, and has been tested on a wide range of downstream tasks, including segmentation and reasoning. The paper concludes that VTW is an effective method for accelerating MLLMs while maintaining performance, making it a valuable tool for real-time applications.