[slides and audio] Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

The paper introduces Visual Tokens Withdrawal (VTW), a plug-and-play module designed to enhance the inference speed of Multimodal Large Language Models (MLLMs) by removing visual tokens from deeper layers. The authors observe two key phenomena: the attention sink phenomenon, where initial and nearest tokens receive most attention, and information migration, where visual information is transferred to subsequent text tokens in the first few layers. Based on these observations, they propose VTW, which strategically removes visual tokens after a specific layer, significantly reducing computational overhead while maintaining performance. Experiments on various multimodal tasks, including visual question answering, hallucination evaluation, visual reasoning, and video understanding, demonstrate that VTW can reduce FLOPs by over 40% without compromising accuracy. The method is also applicable to multimodal chatbots, showing that it can accelerate responses without sacrificing quality. The paper provides a comprehensive analysis of the effectiveness and limitations of VTW, highlighting its potential for further improvements in complex tasks and other modalities.The paper introduces Visual Tokens Withdrawal (VTW), a plug-and-play module designed to enhance the inference speed of Multimodal Large Language Models (MLLMs) by removing visual tokens from deeper layers. The authors observe two key phenomena: the attention sink phenomenon, where initial and nearest tokens receive most attention, and information migration, where visual information is transferred to subsequent text tokens in the first few layers. Based on these observations, they propose VTW, which strategically removes visual tokens after a specific layer, significantly reducing computational overhead while maintaining performance. Experiments on various multimodal tasks, including visual question answering, hallucination evaluation, visual reasoning, and video understanding, demonstrate that VTW can reduce FLOPs by over 40% without compromising accuracy. The method is also applicable to multimodal chatbots, showing that it can accelerate responses without sacrificing quality. The paper provides a comprehensive analysis of the effectiveness and limitations of VTW, highlighting its potential for further improvements in complex tasks and other modalities.

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

25 Jan 2025 | Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji