Understanding Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

The paper introduces ViLD (Vision and Language Knowledge Distillation), a method for open-vocabulary object detection that leverages knowledge distillation from a pretrained open-vocabulary image classification model. The primary challenge in open-vocabulary object detection is the limited availability of training data for rare categories. ViLD addresses this by training a two-stage detector using the category text embeddings and image embeddings inferred by a pretrained model. Specifically, the method involves two components: ViLD-text, which uses text embeddings to classify detected regions, and ViLD-image, which aligns region embeddings with image embeddings. The paper evaluates ViLD on the LVIS dataset, achieving 16.1 mask AP$_r$ with a ResNet-50 backbone, outperforming supervised methods by 3.8. With a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The method is also evaluated on other datasets, showing strong performance on PASCAL VOC, COCO, and Objects365. ViLD demonstrates the effectiveness of knowledge distillation in enabling open-vocabulary detection and can be directly transferred to new datasets without fine-tuning.The paper introduces ViLD (Vision and Language Knowledge Distillation), a method for open-vocabulary object detection that leverages knowledge distillation from a pretrained open-vocabulary image classification model. The primary challenge in open-vocabulary object detection is the limited availability of training data for rare categories. ViLD addresses this by training a two-stage detector using the category text embeddings and image embeddings inferred by a pretrained model. Specifically, the method involves two components: ViLD-text, which uses text embeddings to classify detected regions, and ViLD-image, which aligns region embeddings with image embeddings. The paper evaluates ViLD on the LVIS dataset, achieving 16.1 mask AP$_r$ with a ResNet-50 backbone, outperforming supervised methods by 3.8. With a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The method is also evaluated on other datasets, showing strong performance on PASCAL VOC, COCO, and Objects365. ViLD demonstrates the effectiveness of knowledge distillation in enabling open-vocabulary detection and can be directly transferred to new datasets without fine-tuning.

OPEN-VOCABULARY OBJECT DETECTION VIA VISION AND LANGUAGE KNOWLEDGE DISTILLATION

12 May 2022 | Xiuye Gu1, Tsung-Yi Lin2, Weicheng Kuo1, Yin Cui1