22 May 2024 | Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan
**Abstract:**
Large Multimodal Models (LMMs) have demonstrated significant visual reasoning capabilities by integrating a visual encoder and a large language model. However, the computational costs of these models increase quadratically with the number of input tokens due to the Transformer architecture. To address this, we propose PruMerge, an adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising performance. PruMerge leverages the sparsity in the visual encoder's attention scores between the class token and visual tokens to dynamically select crucial visual tokens. These selected tokens are then clustered and merged to enhance their informational content. Empirically, PruMerge can compress visual tokens by 14 times on average in LLaVA-1.5 while maintaining comparable performance across diverse visual question-answering and reasoning tasks.
**Introduction:**
LMMs, which combine large language models (LLMs) with visual encoders, have shown strong visual reasoning capabilities. However, the computational costs of LMMs are high due to the large number of visual tokens used as prefix content. Previous work has focused on reducing the LLM backbone size, but this sacrifices reasoning abilities. Our approach, PruMerge, aims to reduce the number of visual tokens while maintaining performance. PruMerge uses outlier detection to identify and prune redundant visual tokens, and then merges the remaining tokens to enhance their informational content. Empirically, PruMerge can reduce visual tokens by 14 times on average in LLaVA-1.5 while maintaining comparable performance.
**Related Work:**
Efforts to improve LMM efficiency have focused on reducing the size of the LLM backbone or using quantization techniques. Our work is unique in its focus on reducing the number of visual tokens, which is a key bottleneck in LMMs.
**Method: Token Pru-Merging:**
PruMerge consists of two main components: Adaptive Important Token Selection (AITS) and Token Supplement (TS). AITS uses the Interquartile Range (IQR) method to identify and select important visual tokens based on their attention scores. TS clusters and merges the selected tokens to enhance their informational content.
**Experiments:**
PruMerge is evaluated on various benchmarks, showing significant performance and efficiency improvements. It reduces the number of visual tokens by 14 times on average while maintaining or improving performance. PruMerge also demonstrates generalization to video-LLMs without additional training.
**Conclusion:**
PruMerge effectively reduces the number of visual tokens in LMMs, improving efficiency without sacrificing performance. This approach has broad implications for the deployment and accessibility of LMMs, particularly in edge computing environments.**Abstract:**
Large Multimodal Models (LMMs) have demonstrated significant visual reasoning capabilities by integrating a visual encoder and a large language model. However, the computational costs of these models increase quadratically with the number of input tokens due to the Transformer architecture. To address this, we propose PruMerge, an adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising performance. PruMerge leverages the sparsity in the visual encoder's attention scores between the class token and visual tokens to dynamically select crucial visual tokens. These selected tokens are then clustered and merged to enhance their informational content. Empirically, PruMerge can compress visual tokens by 14 times on average in LLaVA-1.5 while maintaining comparable performance across diverse visual question-answering and reasoning tasks.
**Introduction:**
LMMs, which combine large language models (LLMs) with visual encoders, have shown strong visual reasoning capabilities. However, the computational costs of LMMs are high due to the large number of visual tokens used as prefix content. Previous work has focused on reducing the LLM backbone size, but this sacrifices reasoning abilities. Our approach, PruMerge, aims to reduce the number of visual tokens while maintaining performance. PruMerge uses outlier detection to identify and prune redundant visual tokens, and then merges the remaining tokens to enhance their informational content. Empirically, PruMerge can reduce visual tokens by 14 times on average in LLaVA-1.5 while maintaining comparable performance.
**Related Work:**
Efforts to improve LMM efficiency have focused on reducing the size of the LLM backbone or using quantization techniques. Our work is unique in its focus on reducing the number of visual tokens, which is a key bottleneck in LMMs.
**Method: Token Pru-Merging:**
PruMerge consists of two main components: Adaptive Important Token Selection (AITS) and Token Supplement (TS). AITS uses the Interquartile Range (IQR) method to identify and select important visual tokens based on their attention scores. TS clusters and merges the selected tokens to enhance their informational content.
**Experiments:**
PruMerge is evaluated on various benchmarks, showing significant performance and efficiency improvements. It reduces the number of visual tokens by 14 times on average while maintaining or improving performance. PruMerge also demonstrates generalization to video-LLMs without additional training.
**Conclusion:**
PruMerge effectively reduces the number of visual tokens in LMMs, improving efficiency without sacrificing performance. This approach has broad implications for the deployment and accessibility of LMMs, particularly in edge computing environments.