**GROUNDHOG 🐶: Grounding Large Language Models to Holistic Segmentation**
**Authors:** Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, Joyce Chai
**Institution:** University of Michigan, Amazon AGI
**Abstract:**
Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling, which captures objects by bounding boxes as sequences of location tokens. However, this approach lacks pixel-level representations, leading to ambiguity and interpretability issues. This paper introduces GROUNDHOG, an MLLM that enhances text output with pixel-level phrase grounding across diverse semantic granularities. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone, enabling unified grounding masks by retrieving and merging entity masks. The model is trained on the M3G2 dataset, curated from 27 existing datasets, to achieve superior performance on various language grounding tasks without task-specific fine-tuning. GROUNDHOG demonstrates better grounding towards complex visual inputs and provides easy-to-understand diagnosis in failure cases.
**Key Contributions:**
1. **Pixel-Level Vision-Language Alignment:** GROUNDHOG achieves unprecedented pixel-level vision-language alignment by incorporating a masked feature extractor and converting entity masks into visual entity tokens.
2. **Decoupled Design:** The model decouples entity mask proposal and language-guided grounding, allowing independent improvement of the mask proposal model and MLLM.
3. **Interpretability:** GROUNDHOG provides clear and interpretable diagnostics, making it easier to understand the grounding process and identify failures.
**Methods:**
- **Entity Feature Extraction:** The masked feature extractor extracts features from class-agnostic entity masks using pretrained vision models like CLIP and DINOv2.
- **Grounding Tokens:** introduces grounding tokens <GRD> and </GRD> to indicate the start and end of groundable phrases, facilitating accurate grounding.
- **Mask Retrieval and Merging:** retrieves and merges entity masks to form a single grounding mask for the phrase.
**Dataset:**
- **M3G2:** A Multi-Modal Multi-Grained Grounding dataset consisting of 2.5M text-image pairs, derived from 27 existing datasets, covering 36 sub-problems.
**Experiments:**
- **Performance on Grounding Tasks:** GROUNDHOG outperforms existing models in various grounded vision-language tasks, including grounded language generation, language-guided segmentation, visual question answering, and referential dialogue.
**Conclusion:**
GROUNDHOG is a novel framework that enables pixel-level explainable grounding in large language models, leveraging holistic segmentation. It demonstrates superior performance and interpretability in various grounded vision-language tasks, addressing the limitations of existing models in handling complex visual inputs and providing clear diagnostics.**GROUNDHOG 🐶: Grounding Large Language Models to Holistic Segmentation**
**Authors:** Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, Joyce Chai
**Institution:** University of Michigan, Amazon AGI
**Abstract:**
Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling, which captures objects by bounding boxes as sequences of location tokens. However, this approach lacks pixel-level representations, leading to ambiguity and interpretability issues. This paper introduces GROUNDHOG, an MLLM that enhances text output with pixel-level phrase grounding across diverse semantic granularities. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone, enabling unified grounding masks by retrieving and merging entity masks. The model is trained on the M3G2 dataset, curated from 27 existing datasets, to achieve superior performance on various language grounding tasks without task-specific fine-tuning. GROUNDHOG demonstrates better grounding towards complex visual inputs and provides easy-to-understand diagnosis in failure cases.
**Key Contributions:**
1. **Pixel-Level Vision-Language Alignment:** GROUNDHOG achieves unprecedented pixel-level vision-language alignment by incorporating a masked feature extractor and converting entity masks into visual entity tokens.
2. **Decoupled Design:** The model decouples entity mask proposal and language-guided grounding, allowing independent improvement of the mask proposal model and MLLM.
3. **Interpretability:** GROUNDHOG provides clear and interpretable diagnostics, making it easier to understand the grounding process and identify failures.
**Methods:**
- **Entity Feature Extraction:** The masked feature extractor extracts features from class-agnostic entity masks using pretrained vision models like CLIP and DINOv2.
- **Grounding Tokens:** introduces grounding tokens <GRD> and </GRD> to indicate the start and end of groundable phrases, facilitating accurate grounding.
- **Mask Retrieval and Merging:** retrieves and merges entity masks to form a single grounding mask for the phrase.
**Dataset:**
- **M3G2:** A Multi-Modal Multi-Grained Grounding dataset consisting of 2.5M text-image pairs, derived from 27 existing datasets, covering 36 sub-problems.
**Experiments:**
- **Performance on Grounding Tasks:** GROUNDHOG outperforms existing models in various grounded vision-language tasks, including grounded language generation, language-guided segmentation, visual question answering, and referential dialogue.
**Conclusion:**
GROUNDHOG is a novel framework that enables pixel-level explainable grounding in large language models, leveraging holistic segmentation. It demonstrates superior performance and interpretability in various grounded vision-language tasks, addressing the limitations of existing models in handling complex visual inputs and providing clear diagnostics.