Understanding T-Rex2%3A Towards Generic Object Detection via Text-Visual Prompt Synergy

**T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy** **Authors:** Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang **Institution:** South China University of Technology, International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology, Tsinghua University **Abstract:** T-Rex2 is a highly practical model for open-set object detection that integrates text and visual prompts within a single framework. It leverages contrastive learning to align text and visual prompts, enhancing their complementary strengths. T-Rex2 supports four workflows: text prompt, interactive visual prompt, generic visual prompt, and mixed prompt. Comprehensive experiments demonstrate strong zero-shot object detection capabilities across various scenarios. Text and visual prompts benefit each other, covering different types of objects effectively. The model's API is available at <https://github.com/IDEA-Research/T-Rex>. - **Text Prompted Object Detection:** Models using text prompts for open-vocabulary object detection, leveraging language models like CLIP or BERT. - **Visual Prompted Object Detection:** Models using visual prompts to depict novel objects through concrete examples, but limited by data scarcity and descriptive limitations. - **Interactive Object Detection:** Models that align human intentions in computer vision, allowing users to specify objects through visual prompts. T-Rex2 integrates an image encoder, visual prompt encoder, text prompt encoder, and box decoder. It uses deformable cross-attention for visual prompts and CLIP for text prompts. A contrastive learning module aligns text and visual prompts, enhancing their mutual understanding. T-Rex2 supports four workflows: text prompt, interactive visual prompt, generic visual prompt, and mixed prompt. - **COCO, LVIS, ODinW, Roboflow100:** T-Rex2 demonstrates strong zero-shot object detection capabilities. - **Text Prompt vs. Visual Prompt:** Text prompts excel in common categories, while visual prompts are better for rare categories. - **Interactive Object Detection:** T-Rex2 performs well in dense and small object scenarios. - **Ablation Experiments:** Show the effectiveness of joint training, contrastive alignment, generic visual prompts, mixed prompts, and data engines. T-Rex2 is a promising approach to generic object detection, leveraging the complementary strengths of text and visual prompts. It offers strong zero-shot capabilities and is applicable to various scenarios. Future work will focus on improving modality alignment and reducing the number of visual examples required for reliable detection.**T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy** **Authors:** Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang **Institution:** South China University of Technology, International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology, Tsinghua University **Abstract:** T-Rex2 is a highly practical model for open-set object detection that integrates text and visual prompts within a single framework. It leverages contrastive learning to align text and visual prompts, enhancing their complementary strengths. T-Rex2 supports four workflows: text prompt, interactive visual prompt, generic visual prompt, and mixed prompt. Comprehensive experiments demonstrate strong zero-shot object detection capabilities across various scenarios. Text and visual prompts benefit each other, covering different types of objects effectively. The model's API is available at <https://github.com/IDEA-Research/T-Rex>. - **Text Prompted Object Detection:** Models using text prompts for open-vocabulary object detection, leveraging language models like CLIP or BERT. - **Visual Prompted Object Detection:** Models using visual prompts to depict novel objects through concrete examples, but limited by data scarcity and descriptive limitations. - **Interactive Object Detection:** Models that align human intentions in computer vision, allowing users to specify objects through visual prompts. T-Rex2 integrates an image encoder, visual prompt encoder, text prompt encoder, and box decoder. It uses deformable cross-attention for visual prompts and CLIP for text prompts. A contrastive learning module aligns text and visual prompts, enhancing their mutual understanding. T-Rex2 supports four workflows: text prompt, interactive visual prompt, generic visual prompt, and mixed prompt. - **COCO, LVIS, ODinW, Roboflow100:** T-Rex2 demonstrates strong zero-shot object detection capabilities. - **Text Prompt vs. Visual Prompt:** Text prompts excel in common categories, while visual prompts are better for rare categories. - **Interactive Object Detection:** T-Rex2 performs well in dense and small object scenarios. - **Ablation Experiments:** Show the effectiveness of joint training, contrastive alignment, generic visual prompts, mixed prompts, and data engines. T-Rex2 is a promising approach to generic object detection, leveraging the complementary strengths of text and visual prompts. It offers strong zero-shot capabilities and is applicable to various scenarios. Future work will focus on improving modality alignment and reducing the number of visual examples required for reliable detection.

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

21 Mar 2024 | Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang