GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

16 Apr 2024 | Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozhi Gao, Joyce Chai
GROUNDHOG is a multimodal large language model (MLLM) that enhances text output with pixel-level phrase grounding across diverse semantic granularities. The model is developed by grounding large language models to holistic segmentation. GROUNDHOG incorporates a masked feature extractor that converts extracted features into visual entity tokens for the MLLM backbone, which then connects groundable phrases to unified grounding masks by retrieving and merging entity masks. The model is trained on the M3G2 dataset, a multi-modal multi-grained grounding dataset with 2.5 million text-image pairs, derived from 27 existing datasets. The dataset includes four task types: Grounded Image Captioning (GIC), Referential Expression Segmentation (RES), Grounded Visual Question Answering (GVQA), and Referential Dialogue (RD). GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning and significantly reduces object hallucination. It demonstrates better grounding towards complex forms of visual input and provides easy-to-understand diagnosis in failure cases. The model supports visual pointers as input and can plug-in-and-play with any choice of mask proposal networks, such as Segment Anything Model (SAM). GROUNDHOG also enhances explainability and diagnosability through its decoupled design of entity proposal and selection. The model's performance is evaluated on several benchmarks, including RefCOCO, PhraseCut, ReasonSeg, and TextVQA-X, showing significant improvements over previous models. GROUNDHOG's ability to generalize to pixel-level grounding and handle complex visual tasks is demonstrated through various experiments, including grounded image captioning, language generation, and spatial prompt understanding. The model's performance is further validated through ablation studies, showing that the combination of CLIP and DINOv2 features yields the best results. GROUNDHOG also reduces object hallucination by leveraging the varied task distribution and negative question-answering samples in the M3G2 dataset. The model's effectiveness is highlighted in its ability to handle occluded objects, groups of multiple instances, amorphous background regions, semantic parts of objects, and objects with irregular shapes.GROUNDHOG is a multimodal large language model (MLLM) that enhances text output with pixel-level phrase grounding across diverse semantic granularities. The model is developed by grounding large language models to holistic segmentation. GROUNDHOG incorporates a masked feature extractor that converts extracted features into visual entity tokens for the MLLM backbone, which then connects groundable phrases to unified grounding masks by retrieving and merging entity masks. The model is trained on the M3G2 dataset, a multi-modal multi-grained grounding dataset with 2.5 million text-image pairs, derived from 27 existing datasets. The dataset includes four task types: Grounded Image Captioning (GIC), Referential Expression Segmentation (RES), Grounded Visual Question Answering (GVQA), and Referential Dialogue (RD). GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning and significantly reduces object hallucination. It demonstrates better grounding towards complex forms of visual input and provides easy-to-understand diagnosis in failure cases. The model supports visual pointers as input and can plug-in-and-play with any choice of mask proposal networks, such as Segment Anything Model (SAM). GROUNDHOG also enhances explainability and diagnosability through its decoupled design of entity proposal and selection. The model's performance is evaluated on several benchmarks, including RefCOCO, PhraseCut, ReasonSeg, and TextVQA-X, showing significant improvements over previous models. GROUNDHOG's ability to generalize to pixel-level grounding and handle complex visual tasks is demonstrated through various experiments, including grounded image captioning, language generation, and spatial prompt understanding. The model's performance is further validated through ablation studies, showing that the combination of CLIP and DINOv2 features yields the best results. GROUNDHOG also reduces object hallucination by leveraging the varied task distribution and negative question-answering samples in the M3G2 dataset. The model's effectiveness is highlighted in its ability to handle occluded objects, groups of multiple instances, amorphous background regions, semantic parts of objects, and objects with irregular shapes.
Reach us at info@study.space