27 Jun 2024 | Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan
OMG-LLaVA is a novel multimodal large language model (MLLM) that integrates image-level, object-level, and pixel-level reasoning and understanding capabilities into a single framework. The model combines a universal segmentation method as the visual encoder with a large language model (LLM) to process text instructions and generate responses and pixel-level segmentation results. Key contributions include the use of perception prior embedding to integrate perception priors with image features, and a unified instruction formation strategy to handle various inputs such as visual images, texts, and visual prompts. OMG-LLaVA achieves state-of-the-art performance on multiple benchmarks, including COCO panoptic segmentation, VIPSeg video panoptic segmentation, refCOCO, refCOCO+, refCOCOg referring expression segmentation, GranDf grounded conversation generation, and refCOCOg region caption datasets. The model's architecture is designed to be simple and elegant, consisting of only one visual encoder, one LLM, and one decoder, making it a more efficient and flexible baseline for MLLM design.OMG-LLaVA is a novel multimodal large language model (MLLM) that integrates image-level, object-level, and pixel-level reasoning and understanding capabilities into a single framework. The model combines a universal segmentation method as the visual encoder with a large language model (LLM) to process text instructions and generate responses and pixel-level segmentation results. Key contributions include the use of perception prior embedding to integrate perception priors with image features, and a unified instruction formation strategy to handle various inputs such as visual images, texts, and visual prompts. OMG-LLaVA achieves state-of-the-art performance on multiple benchmarks, including COCO panoptic segmentation, VIPSeg video panoptic segmentation, refCOCO, refCOCO+, refCOCOg referring expression segmentation, GranDf grounded conversation generation, and refCOCOg region caption datasets. The model's architecture is designed to be simple and elegant, consisting of only one visual encoder, one LLM, and one decoder, making it a more efficient and flexible baseline for MLLM design.