Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

14 Jun 2024 | Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, Rong Jin
The paper "Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models" addresses the challenges of high-resolution image processing in Large Multimodal Models (LMMs). The authors propose a new framework, SLiME (LLM with Sophisticated Tasks, Local image compression, and Mixture of global Experts), which aims to enhance the global context while compressing local image tokens to improve computational efficiency. The key contributions include: 1. **Global Context Refinement**: Using a mixture of adapters to refine global context, leveraging the strengths of both MLP and query former adapters. 2. **Local Feature Compression**: Employing a query transformer to compress local features, reducing computational costs. 3. **Alternating Training**: Introducing an alternating training scheme to optimize the bilinear optimization problem, ensuring balanced learning between global and local aspects. 4. **Challenging Dataset**: Creating the Science and Mathematical Reasoning (SMR) dataset, which includes complex reasoning tasks and rich image annotations, enhancing the training of the local compression layer. The empirical results demonstrate that SLiME achieves leading performance across various benchmarks with only 2 million training data, outperforming other models such as Gemini Pro and Qwen-VL-Plus. The paper also discusses the limitations and future work, including the need for further optimization methods and image token reduction techniques.The paper "Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models" addresses the challenges of high-resolution image processing in Large Multimodal Models (LMMs). The authors propose a new framework, SLiME (LLM with Sophisticated Tasks, Local image compression, and Mixture of global Experts), which aims to enhance the global context while compressing local image tokens to improve computational efficiency. The key contributions include: 1. **Global Context Refinement**: Using a mixture of adapters to refine global context, leveraging the strengths of both MLP and query former adapters. 2. **Local Feature Compression**: Employing a query transformer to compress local features, reducing computational costs. 3. **Alternating Training**: Introducing an alternating training scheme to optimize the bilinear optimization problem, ensuring balanced learning between global and local aspects. 4. **Challenging Dataset**: Creating the Science and Mathematical Reasoning (SMR) dataset, which includes complex reasoning tasks and rich image annotations, enhancing the training of the local compression layer. The empirical results demonstrate that SLiME achieves leading performance across various benchmarks with only 2 million training data, outperforming other models such as Gemini Pro and Qwen-VL-Plus. The paper also discusses the limitations and future work, including the need for further optimization methods and image token reduction techniques.
Reach us at info@study.space