Understanding Dense Connector for MLLMs

The paper introduces the Dense Connector (DC), a novel and effective module designed to enhance the visual perception capabilities of Multimodal Large Language Models (MLLLMs). The DC leverages multi-layer visual features from a pre-trained vision encoder, such as CLIP, to provide more comprehensive visual cues to the LLM. This approach is simple, plug-and-play, and minimizes additional computational overhead. The paper explores three instantiations of the DC: Sparse Token Integration (STI), Sparse Channel Integration (SCI), and Dense Channel Integration (DCI). These methods integrate visual features from different layers to enrich the visual input for the LLM, improving its performance in various benchmarks. The DC is evaluated across different visual encoders, image resolutions, training dataset scales, and LLM sizes, demonstrating its versatility and scalability. The paper also extends the DC to video understanding, achieving state-of-the-art performance on multiple video benchmarks without specific video tuning. The results highlight the significant benefits of leveraging multi-layer visual features, providing valuable insights for future MLLM development.The paper introduces the Dense Connector (DC), a novel and effective module designed to enhance the visual perception capabilities of Multimodal Large Language Models (MLLLMs). The DC leverages multi-layer visual features from a pre-trained vision encoder, such as CLIP, to provide more comprehensive visual cues to the LLM. This approach is simple, plug-and-play, and minimizes additional computational overhead. The paper explores three instantiations of the DC: Sparse Token Integration (STI), Sparse Channel Integration (SCI), and Dense Channel Integration (DCI). These methods integrate visual features from different layers to enrich the visual input for the LLM, improving its performance in various benchmarks. The DC is evaluated across different visual encoders, image resolutions, training dataset scales, and LLM sizes, demonstrating its versatility and scalability. The paper also extends the DC to video understanding, achieving state-of-the-art performance on multiple video benchmarks without specific video tuning. The results highlight the significant benefits of leveraging multi-layer visual features, providing valuable insights for future MLLM development.

Dense Connector for MLLMs

22 May 2024 | Huanjin Yao1,3*, Wenhao Wu2*, Taojiannan Yang4, Yuxin Song3, Mengxi Zhang3 Haocheng Feng3, Yifan Sun3, Zhiheng Li1, Wanli Ouyang5, Jingdong Wang3

22 May 2024 | Huanjin Yao1,3, Wenhao Wu2, Taojiannan Yang4, Yuxin Song3, Mengxi Zhang3 Haocheng Feng3, Yifan Sun3, Zhiheng Li1, Wanli Ouyang5, Jingdong Wang3