Grounded Language-Image Pre-training

Grounded Language-Image Pre-training

17 Jun 2022 | Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao
This paper introduces Grounded Language-Image Pre-training (GLIP), a model that learns object-level, language-aware, and semantic-rich visual representations by unifying object detection and phrase grounding. GLIP leverages both detection and grounding data to improve performance on both tasks and bootstraps a good grounding model. By generating grounding boxes from web-crawled image-text pairs, GLIP can be pre-trained on massive image-text data, resulting in semantic-rich representations. Experiments show that GLIP achieves strong zero-shot and few-shot transferability to various object-level recognition tasks, outperforming many supervised baselines and state-of-the-art models. GLIP's unified formulation and deep cross-modality fusion facilitate domain transfer, allowing it to be applied to diverse downstream tasks with minimal fine-tuning. The paper also discusses the benefits of using grounding data and demonstrates the effectiveness of prompt tuning for efficient deployment.This paper introduces Grounded Language-Image Pre-training (GLIP), a model that learns object-level, language-aware, and semantic-rich visual representations by unifying object detection and phrase grounding. GLIP leverages both detection and grounding data to improve performance on both tasks and bootstraps a good grounding model. By generating grounding boxes from web-crawled image-text pairs, GLIP can be pre-trained on massive image-text data, resulting in semantic-rich representations. Experiments show that GLIP achieves strong zero-shot and few-shot transferability to various object-level recognition tasks, outperforming many supervised baselines and state-of-the-art models. GLIP's unified formulation and deep cross-modality fusion facilitate domain transfer, allowing it to be applied to diverse downstream tasks with minimal fine-tuning. The paper also discusses the benefits of using grounding data and demonstrates the effectiveness of prompt tuning for efficient deployment.
Reach us at info@study.space