24 May 2024 | Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng
ConvLLaVA is a hierarchical backbone-based visual encoder for large multimodal models (LMMs) that addresses the challenges of excessive visual tokens and quadratic visual complexity. The model employs ConvNeXt as the visual encoder, which is more efficient than Vision Transformer (ViT) due to its linear spatial complexity and higher compression ratio. ConvNeXt compresses high-resolution images into information-rich visual features, reducing the number of visual tokens and computational load. To further enhance performance, ConvLLaVA introduces an additional stage to compress visual tokens, achieving a 64× compression ratio and generating only 576 visual tokens for 1536×1536 resolution inputs. This approach enables efficient handling of images with arbitrary aspect ratios. Experimental results show that ConvLLaVA outperforms state-of-the-art models on various benchmarks, including MME, MMBench, SEEDBench, TextVQA, DocVQA, POPE, and MMVet. The model's effectiveness is demonstrated through extensive experiments, showing improved performance on both general and fine-grained tasks. ConvLLaVA also supports training on low-resolution inputs and evaluation on high-resolution images, making it a versatile solution for high-resolution LMMs. The model's hierarchical structure and efficient compression strategy significantly reduce redundancy and computational costs, making it a promising approach for future large multimodal models.ConvLLaVA is a hierarchical backbone-based visual encoder for large multimodal models (LMMs) that addresses the challenges of excessive visual tokens and quadratic visual complexity. The model employs ConvNeXt as the visual encoder, which is more efficient than Vision Transformer (ViT) due to its linear spatial complexity and higher compression ratio. ConvNeXt compresses high-resolution images into information-rich visual features, reducing the number of visual tokens and computational load. To further enhance performance, ConvLLaVA introduces an additional stage to compress visual tokens, achieving a 64× compression ratio and generating only 576 visual tokens for 1536×1536 resolution inputs. This approach enables efficient handling of images with arbitrary aspect ratios. Experimental results show that ConvLLaVA outperforms state-of-the-art models on various benchmarks, including MME, MMBench, SEEDBench, TextVQA, DocVQA, POPE, and MMVet. The model's effectiveness is demonstrated through extensive experiments, showing improved performance on both general and fine-grained tasks. ConvLLaVA also supports training on low-resolution inputs and evaluation on high-resolution images, making it a versatile solution for high-resolution LMMs. The model's hierarchical structure and efficient compression strategy significantly reduce redundancy and computational costs, making it a promising approach for future large multimodal models.