LLaVA-UHD is a large multimodal model that efficiently perceives images in any aspect ratio and high resolution. The model addresses the limitations of existing large multimodal models (LMMs) that process images in fixed sizes and limited resolutions. These models often suffer from issues such as shape distortion, blur, and hallucination due to their rigid encoding strategies. LLaVA-UHD introduces three key components: (1) an image modularization strategy that divides native-resolution images into smaller, variable-sized slices for efficient encoding, (2) a compression module that condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs on nine benchmarks, achieving significant accuracy improvements on tasks like TextVQA and POPE. Notably, the model supports 672×1088 resolution images using only 94% inference computation, demonstrating its efficiency and adaptability. The model can be efficiently trained in academic settings, with 23 hours on 8 A100 GPUs, compared to 26 hours for LLaVA-1.5. The work also identifies systematic flaws in the visual encoding strategies of GPT-4V and LLaVA-1.5, highlighting the need for more adaptive and efficient visual encoding methods. LLaVA-UHD's modularized visual encoding strategy allows it to handle images with varied aspect ratios and high resolutions without padding or shape-distorting resizing, improving performance and reducing computational costs. The model's ability to process high-resolution images in native aspect ratios and its efficient compression and spatial organization of image tokens make it a significant advancement in large multimodal models.LLaVA-UHD is a large multimodal model that efficiently perceives images in any aspect ratio and high resolution. The model addresses the limitations of existing large multimodal models (LMMs) that process images in fixed sizes and limited resolutions. These models often suffer from issues such as shape distortion, blur, and hallucination due to their rigid encoding strategies. LLaVA-UHD introduces three key components: (1) an image modularization strategy that divides native-resolution images into smaller, variable-sized slices for efficient encoding, (2) a compression module that condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs on nine benchmarks, achieving significant accuracy improvements on tasks like TextVQA and POPE. Notably, the model supports 672×1088 resolution images using only 94% inference computation, demonstrating its efficiency and adaptability. The model can be efficiently trained in academic settings, with 23 hours on 8 A100 GPUs, compared to 26 hours for LLaVA-1.5. The work also identifies systematic flaws in the visual encoding strategies of GPT-4V and LLaVA-1.5, highlighting the need for more adaptive and efficient visual encoding methods. LLaVA-UHD's modularized visual encoding strategy allows it to handle images with varied aspect ratios and high resolutions without padding or shape-distorting resizing, improving performance and reducing computational costs. The model's ability to process high-resolution images in native aspect ratios and its efficient compression and spatial organization of image tokens make it a significant advancement in large multimodal models.