T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

21 Mar 2024 | Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang
T-Rex2 is a novel model for open-set object detection that integrates text and visual prompts to achieve strong zero-shot capabilities. The model uses a DETR-based architecture with two parallel encoders for text and visual prompts. Text prompts are encoded using CLIP, while visual prompts are processed through a deformable attention mechanism. A contrastive learning module aligns text and visual prompts, enabling mutual enhancement. T-Rex2 supports four workflows: interactive visual prompts, generic visual prompts, text prompts, and mixed prompts. It demonstrates strong performance on benchmarks like COCO, LVIS, ODinW, and Roboflow100. Text prompts excel in common object detection, while visual prompts are better for rare or complex objects. The model's ability to switch between prompt modalities allows it to handle diverse scenarios. Experiments show that combining text and visual prompts improves detection accuracy, with visual prompts performing better in long-tailed distributions. T-Rex2 also excels in interactive object detection and few-shot object counting. Ablation studies confirm the effectiveness of contrastive alignment and the benefits of using multiple visual examples. The model is efficient, with fast inference speeds, and shows promise for generic object detection. Limitations include potential interference between text and visual prompts, and the need for multiple visual examples for reliable detection. Overall, T-Rex2 offers a flexible and effective approach to open-set object detection.T-Rex2 is a novel model for open-set object detection that integrates text and visual prompts to achieve strong zero-shot capabilities. The model uses a DETR-based architecture with two parallel encoders for text and visual prompts. Text prompts are encoded using CLIP, while visual prompts are processed through a deformable attention mechanism. A contrastive learning module aligns text and visual prompts, enabling mutual enhancement. T-Rex2 supports four workflows: interactive visual prompts, generic visual prompts, text prompts, and mixed prompts. It demonstrates strong performance on benchmarks like COCO, LVIS, ODinW, and Roboflow100. Text prompts excel in common object detection, while visual prompts are better for rare or complex objects. The model's ability to switch between prompt modalities allows it to handle diverse scenarios. Experiments show that combining text and visual prompts improves detection accuracy, with visual prompts performing better in long-tailed distributions. T-Rex2 also excels in interactive object detection and few-shot object counting. Ablation studies confirm the effectiveness of contrastive alignment and the benefits of using multiple visual examples. The model is efficient, with fast inference speeds, and shows promise for generic object detection. Limitations include potential interference between text and visual prompts, and the need for multiple visual examples for reliable detection. Overall, T-Rex2 offers a flexible and effective approach to open-set object detection.
Reach us at info@study.space