LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

May 22, 2024 | Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan
LLaVA-PruMerge is an adaptive token reduction method designed to enhance the efficiency of Large Multimodal Models (LMMs) by significantly reducing the number of visual tokens without compromising performance. The method leverages the sparse attention scores between the class token and visual tokens in the visual encoder to identify and retain only the most important visual tokens. By clustering and merging these tokens, the approach maintains the informational content while reducing computational costs. Empirical results show that LLaVA-PruMerge can compress visual tokens by up to 14 times on average, achieving comparable performance across various visual question-answering and reasoning tasks. The method is also applicable to video models, such as Video-LLaVA, where it reduces the number of visual tokens while enhancing performance. The approach is based on adaptive token selection using outlier detection and token supplementation through similarity clustering. PruMerge+ further refines this process by spatially uniform sampling, ensuring a more comprehensive and representative selection of visual tokens. The method demonstrates significant efficiency gains, reducing computational costs and memory demands, while maintaining high performance. The study highlights the potential for significant computational savings without sacrificing the reasoning capabilities of LMMs.LLaVA-PruMerge is an adaptive token reduction method designed to enhance the efficiency of Large Multimodal Models (LMMs) by significantly reducing the number of visual tokens without compromising performance. The method leverages the sparse attention scores between the class token and visual tokens in the visual encoder to identify and retain only the most important visual tokens. By clustering and merging these tokens, the approach maintains the informational content while reducing computational costs. Empirical results show that LLaVA-PruMerge can compress visual tokens by up to 14 times on average, achieving comparable performance across various visual question-answering and reasoning tasks. The method is also applicable to video models, such as Video-LLaVA, where it reduces the number of visual tokens while enhancing performance. The approach is based on adaptive token selection using outlier detection and token supplementation through similarity clustering. PruMerge+ further refines this process by spatially uniform sampling, ensuring a more comprehensive and representative selection of visual tokens. The method demonstrates significant efficiency gains, reducing computational costs and memory demands, while maintaining high performance. The study highlights the potential for significant computational savings without sacrificing the reasoning capabilities of LMMs.
Reach us at info@study.space