5 Mar 2024 | Zhaowei Li12, Qi Xu1, Dong Zhang2, Hang Song1, Yiqing Cai1, Qi Qi1, Ran Zhou1, Junting Pan1, Zefeng Li1, Van Tu Vu1, Zhida Huang1, Tao Wang1
GroundingGPT is an end-to-end language-enhanced multi-modal grounding model designed to perform fine-grained grounding tasks across image, video, and audio modalities. The model employs a coarse-to-fine training strategy, consisting of three stages: multi-modal pre-training, fine-grained alignment tuning, and multi-granularity instruction tuning. Each stage uses a stage-specific dataset to enhance the model's semantic awareness and fine-grained understanding capabilities. Extensive experiments on various benchmarks demonstrate that GroundingGPT achieves impressive fine-grained understanding of multi-modal inputs while maintaining or improving global comprehension. The model's performance in image grounding, video grounding, and multi-modal understanding tasks is evaluated, showing superior results compared to existing models. The code, dataset, and model are made publicly available to facilitate further research.GroundingGPT is an end-to-end language-enhanced multi-modal grounding model designed to perform fine-grained grounding tasks across image, video, and audio modalities. The model employs a coarse-to-fine training strategy, consisting of three stages: multi-modal pre-training, fine-grained alignment tuning, and multi-granularity instruction tuning. Each stage uses a stage-specific dataset to enhance the model's semantic awareness and fine-grained understanding capabilities. Extensive experiments on various benchmarks demonstrate that GroundingGPT achieves impressive fine-grained understanding of multi-modal inputs while maintaining or improving global comprehension. The model's performance in image grounding, video grounding, and multi-modal understanding tasks is evaluated, showing superior results compared to existing models. The code, dataset, and model are made publicly available to facilitate further research.