Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

19 Apr 2024 | Chuofan Ma1*, Yi Jiang2†, Jiannan Wu1, Zehuan Yuan2, Xiaojuan Qi1†
Groma is a Multimodal Large Language Model (MLLM) designed to enhance grounded and fine-grained visual perception. It introduces a localized visual tokenization mechanism that decomposes image inputs into regions of interest and encodes them into region tokens. This approach allows Groma to understand user-specified region inputs and ground its textual output to images, improving its capabilities in region-level tasks such as region captioning and visual grounding. Groma's design decouples localization from high-level understanding, leveraging the spatial understanding capability of the visual tokenizer within the MLLM. To improve its grounded chat ability, Groma curates a visually grounded instruction dataset using GPT-4V and visual prompting techniques. Compared to other MLLMs, Groma demonstrates superior performance in standard referring and grounding benchmarks, highlighting the effectiveness of embedding localization into image tokenization. The project page is available at <https://groma-mllm.github.io/>.Groma is a Multimodal Large Language Model (MLLM) designed to enhance grounded and fine-grained visual perception. It introduces a localized visual tokenization mechanism that decomposes image inputs into regions of interest and encodes them into region tokens. This approach allows Groma to understand user-specified region inputs and ground its textual output to images, improving its capabilities in region-level tasks such as region captioning and visual grounding. Groma's design decouples localization from high-level understanding, leveraging the spatial understanding capability of the visual tokenizer within the MLLM. To improve its grounded chat ability, Groma curates a visually grounded instruction dataset using GPT-4V and visual prompting techniques. Compared to other MLLMs, Groma demonstrates superior performance in standard referring and grounding benchmarks, highlighting the effectiveness of embedding localization into image tokenization. The project page is available at <https://groma-mllm.github.io/>.
Reach us at info@study.space
[slides and audio] Groma%3A Localized Visual Tokenization for Grounding Multimodal Large Language Models