Grounded Language-Image Pre-training

Grounded Language-Image Pre-training

17 Jun 2022 | Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao
This paper introduces GLIP, a grounded language-image pre-training model that learns object-level, language-aware, and semantic-rich visual representations by unifying object detection and phrase grounding. GLIP leverages massive image-text pairs to generate grounding boxes in a self-training fashion, enabling semantic-rich representations. The model is pre-trained on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. GLIP achieves 49.8 AP on COCO and 26.9 AP on LVIS without seeing any images during pre-training, surpassing many supervised baselines. After fine-tuning on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior state-of-the-art models. When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. GLIP also supports prompt tuning, which matches the performance of full fine-tuning but only tunes a fraction of the model parameters. The model's language-aware deep fusion and use of grounding data significantly improve transferability to downstream tasks. GLIP is shown to be effective in zero-shot and few-shot settings, with strong performance on rare categories and diverse real-world tasks. The model's ability to transfer to various tasks with minimal additional data or annotations makes it a promising approach for object detection in the wild.This paper introduces GLIP, a grounded language-image pre-training model that learns object-level, language-aware, and semantic-rich visual representations by unifying object detection and phrase grounding. GLIP leverages massive image-text pairs to generate grounding boxes in a self-training fashion, enabling semantic-rich representations. The model is pre-trained on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. GLIP achieves 49.8 AP on COCO and 26.9 AP on LVIS without seeing any images during pre-training, surpassing many supervised baselines. After fine-tuning on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior state-of-the-art models. When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. GLIP also supports prompt tuning, which matches the performance of full fine-tuning but only tunes a fraction of the model parameters. The model's language-aware deep fusion and use of grounding data significantly improve transferability to downstream tasks. GLIP is shown to be effective in zero-shot and few-shot settings, with strong performance on rare categories and diverse real-world tasks. The model's ability to transfer to various tasks with minimal additional data or annotations makes it a promising approach for object detection in the wild.
Reach us at info@study.space
Understanding Grounded Language-Image Pre-training