31 May 2024 | Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou
DeCo is a method that decouples visual token compression from semantic abstraction in Multimodal Large Language Models (MLLMs). The study identifies a "Double Abstraction" phenomenon in existing projectors, where visual semantics are first abstracted by projectors and then re-extracted by the LLM, leading to inefficiencies and visual semantics deficiency. To address this, DeCo proposes a simple compressor, 2D Adaptive Pooling, to downsample visual patches in a parameter-free manner, allowing the LLM to handle visual semantic abstraction entirely. This approach reduces the number of visual tokens while preserving spatial locality and visual context. Empirical evaluations show that DeCo outperforms traditional compressive projectors in performance and efficiency across various benchmarks, including MLLM Benchmarks, Visual Localization, and Open-ended VQA tasks. DeCo achieves performance gains of 0.9%, 7.1%, and 2.9% with fewer trainable parameters and faster convergence. It also demonstrates robustness across different MLLM configurations, including varying vision backbones, image resolutions, and LLMs. The method is implemented using a simple AdaptiveAvgPool, which reduces the visual token number at the patch level and allows the LLM to process the compressed tokens. The results show that DeCo is effective, efficient, and robust, making it a promising solution for improving MLLMs.DeCo is a method that decouples visual token compression from semantic abstraction in Multimodal Large Language Models (MLLMs). The study identifies a "Double Abstraction" phenomenon in existing projectors, where visual semantics are first abstracted by projectors and then re-extracted by the LLM, leading to inefficiencies and visual semantics deficiency. To address this, DeCo proposes a simple compressor, 2D Adaptive Pooling, to downsample visual patches in a parameter-free manner, allowing the LLM to handle visual semantic abstraction entirely. This approach reduces the number of visual tokens while preserving spatial locality and visual context. Empirical evaluations show that DeCo outperforms traditional compressive projectors in performance and efficiency across various benchmarks, including MLLM Benchmarks, Visual Localization, and Open-ended VQA tasks. DeCo achieves performance gains of 0.9%, 7.1%, and 2.9% with fewer trainable parameters and faster convergence. It also demonstrates robustness across different MLLM configurations, including varying vision backbones, image resolutions, and LLMs. The method is implemented using a simple AdaptiveAvgPool, which reduces the visual token number at the patch level and allows the LLM to process the compressed tokens. The results show that DeCo is effective, efficient, and robust, making it a promising solution for improving MLLMs.