[slides and audio] ConvLLaVA%3A Hierarchical Backbones as Visual Encoder for Large Multimodal Models

ConvLLaVA is a novel approach designed to address the challenges of high-resolution Large Multimodal Models (LMMs) by employing a hierarchical backbone, ConvNeXt, as the visual encoder. The primary issue with current high-resolution LMMs is the quadratic visual complexity and the generation of excessive visual tokens, which leads to significant computational overhead. ConvLLaVA aims to mitigate this by compressing high-resolution images into information-rich visual features, effectively reducing the number of visual tokens. To enhance the capabilities of ConvLLaVA, two key optimizations are proposed: 1. **Updating the Visual Encoder**: ConvNeXt, which is pre-trained on low-resolution data, is updated to bridge the gap between low and high resolutions. This update improves the performance of ConvNeXt on general benchmarks and fine-grained benchmarks. 2. **Training an Additional Stage**: An additional stage of ConvNeXt is trained to further compress the visual tokens, reducing redundancy and improving efficiency. These optimizations enable ConvLLaVA to support inputs of 1536×1536 resolution while generating only 576 visual tokens, handling images of arbitrary aspect ratios. Experimental results demonstrate that ConvLLaVA achieves competitive performance with state-of-the-art models on various benchmarks, including MME, MMBench, SEEDBench, RealWorldQA, TextVQA, DocVQA, POPE, and MMVet. The paper also discusses the importance of linear spatial complexity and information compression for future visual encoders in LMMs, highlighting the trade-off between compression and retrieval capabilities for high-resolution understanding. Overall, ConvLLaVA provides a simple yet effective solution to scaling up resolution and maintaining computational efficiency.ConvLLaVA is a novel approach designed to address the challenges of high-resolution Large Multimodal Models (LMMs) by employing a hierarchical backbone, ConvNeXt, as the visual encoder. The primary issue with current high-resolution LMMs is the quadratic visual complexity and the generation of excessive visual tokens, which leads to significant computational overhead. ConvLLaVA aims to mitigate this by compressing high-resolution images into information-rich visual features, effectively reducing the number of visual tokens. To enhance the capabilities of ConvLLaVA, two key optimizations are proposed: 1. **Updating the Visual Encoder**: ConvNeXt, which is pre-trained on low-resolution data, is updated to bridge the gap between low and high resolutions. This update improves the performance of ConvNeXt on general benchmarks and fine-grained benchmarks. 2. **Training an Additional Stage**: An additional stage of ConvNeXt is trained to further compress the visual tokens, reducing redundancy and improving efficiency. These optimizations enable ConvLLaVA to support inputs of 1536×1536 resolution while generating only 576 visual tokens, handling images of arbitrary aspect ratios. Experimental results demonstrate that ConvLLaVA achieves competitive performance with state-of-the-art models on various benchmarks, including MME, MMBench, SEEDBench, RealWorldQA, TextVQA, DocVQA, POPE, and MMVet. The paper also discusses the importance of linear spatial complexity and information compression for future visual encoders in LMMs, highlighting the trade-off between compression and retrieval capabilities for high-resolution understanding. Overall, ConvLLaVA provides a simple yet effective solution to scaling up resolution and maintaining computational efficiency.

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

24 May 2024 | Chunjiang Ge1, Sijie Cheng3, Ziming Wang2, Jiale Yuan2, Yuan Gao2 Jun Song2, Shiji Song1, Gao Huang1, Bo Zheng2