Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

14 Jun 2024 | Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, Rong Jin
This paper introduces SliME, a novel framework for large multimodal models (LMMs) that enhances image understanding by prioritizing global context while incorporating local image details. The authors address the challenges of high-resolution image processing, where traditional methods often suffer from high computational costs and reduced global context. SliME employs a mixture of adapters to refine global context and a query transformer for local feature compression, along with a text-guided router to select relevant local image tokens. An alternating training strategy is proposed to balance global and local learning, ensuring effective optimization. The framework is evaluated on a challenging dataset, SMR, which includes tasks requiring detailed visual reasoning. SliME achieves state-of-the-art performance across multiple benchmarks, even with only 2 million training data. The method demonstrates superior performance in tasks requiring detailed visual understanding and reasoning, highlighting its effectiveness in handling high-resolution images. The paper also discusses the importance of alternating training in optimizing bilinear functions and the benefits of using a diverse dataset for training. Overall, SliME offers a promising approach to improving LMMs for high-resolution image processing.This paper introduces SliME, a novel framework for large multimodal models (LMMs) that enhances image understanding by prioritizing global context while incorporating local image details. The authors address the challenges of high-resolution image processing, where traditional methods often suffer from high computational costs and reduced global context. SliME employs a mixture of adapters to refine global context and a query transformer for local feature compression, along with a text-guided router to select relevant local image tokens. An alternating training strategy is proposed to balance global and local learning, ensuring effective optimization. The framework is evaluated on a challenging dataset, SMR, which includes tasks requiring detailed visual reasoning. SliME achieves state-of-the-art performance across multiple benchmarks, even with only 2 million training data. The method demonstrates superior performance in tasks requiring detailed visual understanding and reasoning, highlighting its effectiveness in handling high-resolution images. The paper also discusses the importance of alternating training in optimizing bilinear functions and the benefits of using a diverse dataset for training. Overall, SliME offers a promising approach to improving LMMs for high-resolution image processing.
Reach us at info@study.space