OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

27 Jun 2024 | Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan
OMG-LLaVA is a novel framework that integrates image-level, object-level, and pixel-level reasoning and understanding into a single model. It combines a powerful pixel-level vision understanding with reasoning abilities, allowing it to accept various visual and text prompts for flexible user interaction. The framework uses a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. The framework also proposes perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. The code and model have been released for further research. The framework is designed to be simple and elegant, with only one visual encoder, one decoder, and one LLM. It supports a wide range of tasks, including image captioning, image-based conversation, region captioning, visual prompt-based conversation, referring segmentation, reasoning segmentation, and grounded conversation generation. The framework is trained using a two-stage process: pretraining and instruction tuning. The pretraining stage involves freezing the perception model and LLM, while the instruction tuning stage involves fine-tuning the visual and text projectors. The framework is evaluated on multiple datasets, including referring expression segmentation and grounded conversation generation, and shows strong performance on these tasks. The framework is also compared with other MLLMs and shows superior performance in terms of pixel-level and object-level understanding and reasoning. The framework is designed to be efficient and effective, with a simple and elegant system design.OMG-LLaVA is a novel framework that integrates image-level, object-level, and pixel-level reasoning and understanding into a single model. It combines a powerful pixel-level vision understanding with reasoning abilities, allowing it to accept various visual and text prompts for flexible user interaction. The framework uses a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. The framework also proposes perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. The code and model have been released for further research. The framework is designed to be simple and elegant, with only one visual encoder, one decoder, and one LLM. It supports a wide range of tasks, including image captioning, image-based conversation, region captioning, visual prompt-based conversation, referring segmentation, reasoning segmentation, and grounded conversation generation. The framework is trained using a two-stage process: pretraining and instruction tuning. The pretraining stage involves freezing the perception model and LLM, while the instruction tuning stage involves fine-tuning the visual and text projectors. The framework is evaluated on multiple datasets, including referring expression segmentation and grounded conversation generation, and shows strong performance on these tasks. The framework is also compared with other MLLMs and shows superior performance in terms of pixel-level and object-level understanding and reasoning. The framework is designed to be efficient and effective, with a simple and elegant system design.
Reach us at info@study.space
[slides] OMG-LLaVA%3A Bridging Image-level%2C Object-level%2C Pixel-level Reasoning and Understanding | StudySpace