This paper introduces the Dense Connector, a simple and effective plug-and-play vision-language connector that enhances the visual representation of Multimodal Large Language Models (MLLMs) with minimal additional computational overhead. The Dense Connector leverages multi-layer visual features from a pre-trained visual encoder to provide more visual cues to the LLM. The paper proposes three instantiations of the Dense Connector: Sparse Token Integration (STI), Sparse Channel Integration (SCI), and Dense Channel Integration (DCI). These methods demonstrate significant improvements in performance across various vision encoders, image resolutions, training dataset scales, and LLM sizes. The Dense Connector is compatible with different MLLM architectures, including LLaVA and Mini-Gemini, and achieves state-of-the-art performance on 19 image and video benchmarks. The model is trained solely on images and exhibits remarkable zero-shot capabilities in video understanding. The paper also explores the scalability and compatibility of the Dense Connector across different visual encoders and LLM sizes, demonstrating its versatility and effectiveness in enhancing MLLMs. The Dense Connector is designed to be easily integrated into existing MLLMs and can be extended to video understanding without additional training. The results show that the Dense Connector significantly improves the visual perception capabilities of MLLMs, leading to more accurate responses. The paper also discusses the limitations of the Dense Connector, including the lack of additional parameters and the need for further research on more efficient ways to connect visual and language models.This paper introduces the Dense Connector, a simple and effective plug-and-play vision-language connector that enhances the visual representation of Multimodal Large Language Models (MLLMs) with minimal additional computational overhead. The Dense Connector leverages multi-layer visual features from a pre-trained visual encoder to provide more visual cues to the LLM. The paper proposes three instantiations of the Dense Connector: Sparse Token Integration (STI), Sparse Channel Integration (SCI), and Dense Channel Integration (DCI). These methods demonstrate significant improvements in performance across various vision encoders, image resolutions, training dataset scales, and LLM sizes. The Dense Connector is compatible with different MLLM architectures, including LLaVA and Mini-Gemini, and achieves state-of-the-art performance on 19 image and video benchmarks. The model is trained solely on images and exhibits remarkable zero-shot capabilities in video understanding. The paper also explores the scalability and compatibility of the Dense Connector across different visual encoders and LLM sizes, demonstrating its versatility and effectiveness in enhancing MLLMs. The Dense Connector is designed to be easily integrated into existing MLLMs and can be extended to video understanding without additional training. The results show that the Dense Connector significantly improves the visual perception capabilities of MLLMs, leading to more accurate responses. The paper also discusses the limitations of the Dense Connector, including the lack of additional parameters and the need for further research on more efficient ways to connect visual and language models.