31 May 2024 | Linli Yao†, Lei Li‡, Shuhuai Ren†, Lean Wang†, Yuanxin Liu†, Xu Sun†, Lu Hou§
The paper "DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models" by Linli Yao et al. addresses the issue of visual projector effectiveness in Multimodal Large Language Models (MLLMs). The authors propose a novel analysis tool, R-GAE, to trace the semantic flow from generated language tokens to raw visual encoder patches and intermediate projector outputs. They find that compressive projectors, such as QFormer, abstract visual patches into a limited set of semantic concepts, leading to a "double abstraction" phenomenon. This involves a first visual semantic abstraction by the projector and a second extraction by the LLM, resulting in inefficiencies and cumulative vision semantics deficiency.
To mitigate this issue, the authors introduce the "Decouple Compression from Abstraction (DeCo)" approach. DeCo compresses visual tokens at the patch level using a simple compressor, such as 2D Adaptive Pooling, allowing the LLM to handle visual semantic abstraction entirely. Empirical evaluations show that DeCo outperforms traditional compressive projectors in terms of both performance and efficiency, achieving gains of 0.9%, 7.1%, and 2.9% across various benchmarks. DeCo also preserves vision spatial locality and demonstrates robustness across different MLLM configurations, including different vision backbones, image resolutions, and LLMs. The code for DeCo is available at https://github.com/yaolinli/DeCo.The paper "DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models" by Linli Yao et al. addresses the issue of visual projector effectiveness in Multimodal Large Language Models (MLLMs). The authors propose a novel analysis tool, R-GAE, to trace the semantic flow from generated language tokens to raw visual encoder patches and intermediate projector outputs. They find that compressive projectors, such as QFormer, abstract visual patches into a limited set of semantic concepts, leading to a "double abstraction" phenomenon. This involves a first visual semantic abstraction by the projector and a second extraction by the LLM, resulting in inefficiencies and cumulative vision semantics deficiency.
To mitigate this issue, the authors introduce the "Decouple Compression from Abstraction (DeCo)" approach. DeCo compresses visual tokens at the patch level using a simple compressor, such as 2D Adaptive Pooling, allowing the LLM to handle visual semantic abstraction entirely. Empirical evaluations show that DeCo outperforms traditional compressive projectors in terms of both performance and efficiency, achieving gains of 0.9%, 7.1%, and 2.9% across various benchmarks. DeCo also preserves vision spatial locality and demonstrates robustness across different MLLM configurations, including different vision backbones, image resolutions, and LLMs. The code for DeCo is available at https://github.com/yaolinli/DeCo.