[slides] Generative Region-Language Pretraining for Open-Ended Object Detection

This paper introduces GenerateU, a novel approach for generative open-ended object detection. The goal is to detect dense objects and generate their names in a free-form manner without relying on predefined categories during inference. GenerateU is formulated as a generative problem, combining an open-world object detector and a language model. The object detector, specifically Deformable DETR, localizes image regions, while the language model translates these regions into object names. The model is trained end-to-end using a small set of human-annotated object-language paired data and large-scale image-text pairs. To enhance label diversity, a pseudo-labeling method is employed, generating additional labels for missing objects in images. Extensive experiments on the LVIS dataset demonstrate that GenerateU achieves comparable zero-shot detection performance to open-vocabulary object detection methods like GLIP, even without access to category names during inference. The paper also evaluates the model's transferability to other datasets and presents qualitative results showcasing its superior object detection capabilities.This paper introduces GenerateU, a novel approach for generative open-ended object detection. The goal is to detect dense objects and generate their names in a free-form manner without relying on predefined categories during inference. GenerateU is formulated as a generative problem, combining an open-world object detector and a language model. The object detector, specifically Deformable DETR, localizes image regions, while the language model translates these regions into object names. The model is trained end-to-end using a small set of human-annotated object-language paired data and large-scale image-text pairs. To enhance label diversity, a pseudo-labeling method is employed, generating additional labels for missing objects in images. Extensive experiments on the LVIS dataset demonstrate that GenerateU achieves comparable zero-shot detection performance to open-vocabulary object detection methods like GLIP, even without access to category names during inference. The paper also evaluates the model's transferability to other datasets and presents qualitative results showcasing its superior object detection capabilities.

Generative Region-Language Pretraining for Open-Ended Object Detection

15 Mar 2024 | Chuang Lin, Yi Jiang, Lizhen Qu, Zehuan Yuan, Jianfei Cai