15 Mar 2024 | Chuang Lin, Yi Jiang, Lizhen Qu, Zehuan Yuan, Jianfei Cai
This paper introduces GenerateU, a novel approach for generative open-ended object detection. The goal is to build an open-world object detector that can localize all objects in an image and provide their corresponding category names in a free-form way. GenerateU is formulated as a generative problem, eliminating the need for predefined categories during inference. The model consists of an open-world object detector and a language model. The detector is trained using a small set of human-annotated object-language paired data and scaled up with massive image-text pairs. A pseudo-labeling method is used to enrich the label diversity.
GenerateU employs Deformable DETR as a region proposal generator, with a language model translating visual regions to object names. The model is trained end-to-end, optimizing both components simultaneously. The language model is trained using a region-word alignment loss to improve the model's ability to distinguish region features. The model is evaluated on the LVIS dataset, achieving comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen during inference.
The paper also discusses related work in open-vocabulary object detection, multimodal large language models, and dense captioning. It presents experiments showing that GenerateU achieves strong zero-shot detection performance on LVIS, and can transfer to various downstream datasets without requiring any modifications. The model is trained on 16 A100 GPUs and uses a Swin-Large backbone for improved performance. The results demonstrate the effectiveness of the proposed generative object detection method, highlighting the importance of end-to-end training in open-ended object detection. The model is also evaluated using different metrics, including METEOR and similarity scores, to assess the quality of generated text and the performance of object detection. The results show that GenerateU can effectively detect a wide variety of objects, providing valuable insights into the model's strengths.This paper introduces GenerateU, a novel approach for generative open-ended object detection. The goal is to build an open-world object detector that can localize all objects in an image and provide their corresponding category names in a free-form way. GenerateU is formulated as a generative problem, eliminating the need for predefined categories during inference. The model consists of an open-world object detector and a language model. The detector is trained using a small set of human-annotated object-language paired data and scaled up with massive image-text pairs. A pseudo-labeling method is used to enrich the label diversity.
GenerateU employs Deformable DETR as a region proposal generator, with a language model translating visual regions to object names. The model is trained end-to-end, optimizing both components simultaneously. The language model is trained using a region-word alignment loss to improve the model's ability to distinguish region features. The model is evaluated on the LVIS dataset, achieving comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen during inference.
The paper also discusses related work in open-vocabulary object detection, multimodal large language models, and dense captioning. It presents experiments showing that GenerateU achieves strong zero-shot detection performance on LVIS, and can transfer to various downstream datasets without requiring any modifications. The model is trained on 16 A100 GPUs and uses a Swin-Large backbone for improved performance. The results demonstrate the effectiveness of the proposed generative object detection method, highlighting the importance of end-to-end training in open-ended object detection. The model is also evaluated using different metrics, including METEOR and similarity scores, to assess the quality of generated text and the performance of object detection. The results show that GenerateU can effectively detect a wide variety of objects, providing valuable insights into the model's strengths.