LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

18 Mar 2024 | Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang
**LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images** This paper addresses the limitations of existing large multimodal models (LMMs) in processing images with varying aspect ratios and high resolutions. Conventional LMMs, such as GPT-4V and LLaVA-1.5, often struggle with these challenges due to their fixed-size and low-resolution image processing capabilities, leading to shape distortion and blurred content. To overcome these issues, the authors propose LLaVA-UHD, a novel large multimodal model designed to efficiently perceive images in any aspect ratio and high resolution. **Key Components of LLaVA-UHD:** 1. **Image Modularization Strategy:** Divides native-resolution images into smaller variable-sized slices to ensure efficient and extensible encoding. 2. **Compression Module:** Condenses image tokens from visual encoders to reduce computational load. 3. **Spatial Schema:** Organizes slice tokens for LLMs to maintain spatial context. **Contributions:** 1. **Mechanistic Investigation:** Conducts the first mechanistic investigation of GPT-4V's flaws from the perspective of visual encoding strategy. 2. **Model Proposal:** Presents LLaVA-UHD, a model that efficiently perceives images in any aspect ratio and high resolution. 3. **Experimental Validation:** Demonstrates the effectiveness of LLaVA-UHD on 9 benchmarks, showing significant improvements over established LMMs trained with 2-3 orders of magnitude more data. **Key Results:** - LLaVA-UHD supports 672×1088 resolution images using only 94% of the inference computation of LLaVA-1.5. - Achieves a 6.4 accuracy improvement on TextVQA and a 3.2 accuracy improvement on POPE. - Can be efficiently trained in academic settings, completing training within 23 hours on 8 A100 GPUs. **Discussion:** - The paper highlights the importance of adaptive and efficient visual encoding methods to address the challenges of processing high-resolution images. - Future work will focus on higher-resolution images and more complex tasks like small object detection and segmentation. - Potential negative impacts, such as the vulnerability to adversarial attacks, are also discussed, emphasizing the need for further research to ensure robustness and safety.**LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images** This paper addresses the limitations of existing large multimodal models (LMMs) in processing images with varying aspect ratios and high resolutions. Conventional LMMs, such as GPT-4V and LLaVA-1.5, often struggle with these challenges due to their fixed-size and low-resolution image processing capabilities, leading to shape distortion and blurred content. To overcome these issues, the authors propose LLaVA-UHD, a novel large multimodal model designed to efficiently perceive images in any aspect ratio and high resolution. **Key Components of LLaVA-UHD:** 1. **Image Modularization Strategy:** Divides native-resolution images into smaller variable-sized slices to ensure efficient and extensible encoding. 2. **Compression Module:** Condenses image tokens from visual encoders to reduce computational load. 3. **Spatial Schema:** Organizes slice tokens for LLMs to maintain spatial context. **Contributions:** 1. **Mechanistic Investigation:** Conducts the first mechanistic investigation of GPT-4V's flaws from the perspective of visual encoding strategy. 2. **Model Proposal:** Presents LLaVA-UHD, a model that efficiently perceives images in any aspect ratio and high resolution. 3. **Experimental Validation:** Demonstrates the effectiveness of LLaVA-UHD on 9 benchmarks, showing significant improvements over established LMMs trained with 2-3 orders of magnitude more data. **Key Results:** - LLaVA-UHD supports 672×1088 resolution images using only 94% of the inference computation of LLaVA-1.5. - Achieves a 6.4 accuracy improvement on TextVQA and a 3.2 accuracy improvement on POPE. - Can be efficiently trained in academic settings, completing training within 23 hours on 8 A100 GPUs. **Discussion:** - The paper highlights the importance of adaptive and efficient visual encoding methods to address the challenges of processing high-resolution images. - Future work will focus on higher-resolution images and more complex tasks like small object detection and segmentation. - Potential negative impacts, such as the vulnerability to adversarial attacks, are also discussed, emphasizing the need for further research to ensure robustness and safety.
Reach us at info@study.space