Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

19 Apr 2024 | Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi
Groma is a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. It enables region-level tasks such as region captioning and visual grounding by using a localized visual tokenization mechanism. An image is decomposed into regions of interest, which are then encoded into region tokens. These tokens are integrated into user instructions and model responses, allowing Groma to understand user-specified region inputs and ground its textual output to images. A visually grounded instruction dataset was curated to enhance Groma's grounded chat ability. Compared to MLLMs that rely on language models or external modules for localization, Groma consistently performs better in standard referring and grounding benchmarks. Groma's design integrates localization into image tokenization, enabling it to handle high-resolution inputs efficiently. It also incorporates region tokenization alongside standard image tokenization to identify and encode potential regions of interest into region tokens. Groma's tokenizer can also encode user-specified region inputs into region tokens, which are directly inserted into user instructions to initiate referential dialogue. Groma's performance is superior in various benchmarks, including referring expression comprehension, region captioning, and conversational VQA. It also demonstrates strong image-level understanding and reasoning abilities. Groma's design allows it to handle multiple, diverse, and variably-sized objects, as demonstrated by its performance on the LVIS-Ground benchmark. The model's ability to generate long-form, grounded, and logically rich answers is attributed to the use of Groma Instruct data in finetuning. Groma's architecture includes an image encoder, region proposer, region encoder, and large language model. The image encoder uses a pretrained DINOv2 model, while the region proposer is implemented as a class-agnostic detector head. The region encoder translates region proposals into region tokens, and the language model models multimodal input and output. Groma's training involves detection pretraining, alignment pretraining, and instruction finetuning. The model's performance is evaluated on various benchmarks, demonstrating its superior localization and grounding capabilities. Groma's design effectively decouples localization and understanding, allowing it to achieve robust and precise localization without relying on external modules. The model's performance is further enhanced by using a visually grounded instruction dataset and a unified refer-and-ground formulation. Groma's ability to handle free-form region inputs and pixel-level grounding is still under development.Groma is a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. It enables region-level tasks such as region captioning and visual grounding by using a localized visual tokenization mechanism. An image is decomposed into regions of interest, which are then encoded into region tokens. These tokens are integrated into user instructions and model responses, allowing Groma to understand user-specified region inputs and ground its textual output to images. A visually grounded instruction dataset was curated to enhance Groma's grounded chat ability. Compared to MLLMs that rely on language models or external modules for localization, Groma consistently performs better in standard referring and grounding benchmarks. Groma's design integrates localization into image tokenization, enabling it to handle high-resolution inputs efficiently. It also incorporates region tokenization alongside standard image tokenization to identify and encode potential regions of interest into region tokens. Groma's tokenizer can also encode user-specified region inputs into region tokens, which are directly inserted into user instructions to initiate referential dialogue. Groma's performance is superior in various benchmarks, including referring expression comprehension, region captioning, and conversational VQA. It also demonstrates strong image-level understanding and reasoning abilities. Groma's design allows it to handle multiple, diverse, and variably-sized objects, as demonstrated by its performance on the LVIS-Ground benchmark. The model's ability to generate long-form, grounded, and logically rich answers is attributed to the use of Groma Instruct data in finetuning. Groma's architecture includes an image encoder, region proposer, region encoder, and large language model. The image encoder uses a pretrained DINOv2 model, while the region proposer is implemented as a class-agnostic detector head. The region encoder translates region proposals into region tokens, and the language model models multimodal input and output. Groma's training involves detection pretraining, alignment pretraining, and instruction finetuning. The model's performance is evaluated on various benchmarks, demonstrating its superior localization and grounding capabilities. Groma's design effectively decouples localization and understanding, allowing it to achieve robust and precise localization without relying on external modules. The model's performance is further enhanced by using a visually grounded instruction dataset and a unified refer-and-ground formulation. Groma's ability to handle free-form region inputs and pixel-level grounding is still under development.
Reach us at info@study.space