2024-03-12 | Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Fei Wen, Hugo Latapie, Mohsen Imani
TaskCLIP is a two-stage framework for task-oriented object detection that leverages large Vision-Language Models (VLMs) to improve performance. The framework consists of two stages: general object detection and task-guided object selection. In the first stage, the model performs standard object detection to identify potential objects in a scene. In the second stage, it uses VLMs to align image and text embeddings, enabling the selection of objects that best match the task's requirements. To address the challenge of misalignment between object image embeddings and their corresponding visual attributes, TaskCLIP introduces a transformer-based aligner to recalibrate the vision and text embedding space. Additionally, a select-by-grouping mechanism is proposed to mitigate the issue of high false negative mispredictions due to imbalanced training data. This mechanism efficiently utilizes the classification output of the object detection network. TaskCLIP achieves state-of-the-art performance on the COCO-Tasks dataset, outperforming the DETR-based model TOIST by 3.5% in mAP@0.5. The framework is trained on a single NVIDIA RTX 4090 GPU, demonstrating high training efficiency. TaskCLIP's two-stage design is more natural, generalizable, and efficient compared to all-in-one models. The framework effectively recalibrates item embeddings with task-related attribute embeddings, achieving superior performance in task-oriented object detection.TaskCLIP is a two-stage framework for task-oriented object detection that leverages large Vision-Language Models (VLMs) to improve performance. The framework consists of two stages: general object detection and task-guided object selection. In the first stage, the model performs standard object detection to identify potential objects in a scene. In the second stage, it uses VLMs to align image and text embeddings, enabling the selection of objects that best match the task's requirements. To address the challenge of misalignment between object image embeddings and their corresponding visual attributes, TaskCLIP introduces a transformer-based aligner to recalibrate the vision and text embedding space. Additionally, a select-by-grouping mechanism is proposed to mitigate the issue of high false negative mispredictions due to imbalanced training data. This mechanism efficiently utilizes the classification output of the object detection network. TaskCLIP achieves state-of-the-art performance on the COCO-Tasks dataset, outperforming the DETR-based model TOIST by 3.5% in mAP@0.5. The framework is trained on a single NVIDIA RTX 4090 GPU, demonstrating high training efficiency. TaskCLIP's two-stage design is more natural, generalizable, and efficient compared to all-in-one models. The framework effectively recalibrates item embeddings with task-related attribute embeddings, achieving superior performance in task-oriented object detection.