5 Mar 2024 | Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Van Tu Vu, Zhida Huang, Tao Wang
GroundingGPT is a language-enhanced multi-modal grounding model designed to perform fine-grained grounding tasks across image, video, and audio modalities. The model employs a three-stage coarse-to-fine training strategy to enhance semantic awareness and fine-grained understanding. It uses a diversified dataset construction pipeline to create a multi-modal, multi-granularity dataset for training. The model's performance is evaluated on multiple benchmarks, demonstrating strong fine-grained understanding and maintaining or improving global comprehension. GroundingGPT introduces modality-specific adapters to map feature representations to the LLM embedding space, and represents coordinates and timestamps as textual numbers to avoid vocabulary expansion. The model is trained with a three-stage process: multi-modal pre-training, fine-grained alignment tuning, and multi-granularity instruction tuning. Extensive experiments show that GroundingGPT achieves state-of-the-art performance in multi-modal grounding and understanding tasks, outperforming other models in image, video, and audio modalities. The model also effectively suppresses object hallucination, demonstrating strong performance in object localization and understanding. The model is made publicly available for further research.GroundingGPT is a language-enhanced multi-modal grounding model designed to perform fine-grained grounding tasks across image, video, and audio modalities. The model employs a three-stage coarse-to-fine training strategy to enhance semantic awareness and fine-grained understanding. It uses a diversified dataset construction pipeline to create a multi-modal, multi-granularity dataset for training. The model's performance is evaluated on multiple benchmarks, demonstrating strong fine-grained understanding and maintaining or improving global comprehension. GroundingGPT introduces modality-specific adapters to map feature representations to the LLM embedding space, and represents coordinates and timestamps as textual numbers to avoid vocabulary expansion. The model is trained with a three-stage process: multi-modal pre-training, fine-grained alignment tuning, and multi-granularity instruction tuning. Extensive experiments show that GroundingGPT achieves state-of-the-art performance in multi-modal grounding and understanding tasks, outperforming other models in image, video, and audio modalities. The model also effectively suppresses object hallucination, demonstrating strong performance in object localization and understanding. The model is made publicly available for further research.