T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

21 Mar 2024 | Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang
**T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy** **Authors:** Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang **Institution:** South China University of Technology, International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology, Tsinghua University **Abstract:** T-Rex2 is a highly practical model for open-set object detection that integrates text and visual prompts within a single framework. It leverages contrastive learning to align text and visual prompts, enhancing their complementary strengths. T-Rex2 supports four workflows: text prompt, interactive visual prompt, generic visual prompt, and mixed prompt. Comprehensive experiments demonstrate strong zero-shot object detection capabilities across various scenarios. Text and visual prompts benefit each other, covering different types of objects effectively. The model's API is available at <https://github.com/IDEA-Research/T-Rex>. - **Text Prompted Object Detection:** Models using text prompts for open-vocabulary object detection, leveraging language models like CLIP or BERT. - **Visual Prompted Object Detection:** Models using visual prompts to depict novel objects through concrete examples, but limited by data scarcity and descriptive limitations. - **Interactive Object Detection:** Models that align human intentions in computer vision, allowing users to specify objects through visual prompts. T-Rex2 integrates an image encoder, visual prompt encoder, text prompt encoder, and box decoder. It uses deformable cross-attention for visual prompts and CLIP for text prompts. A contrastive learning module aligns text and visual prompts, enhancing their mutual understanding. T-Rex2 supports four workflows: text prompt, interactive visual prompt, generic visual prompt, and mixed prompt. - **COCO, LVIS, ODinW, Roboflow100:** T-Rex2 demonstrates strong zero-shot object detection capabilities. - **Text Prompt vs. Visual Prompt:** Text prompts excel in common categories, while visual prompts are better for rare categories. - **Interactive Object Detection:** T-Rex2 performs well in dense and small object scenarios. - **Ablation Experiments:** Show the effectiveness of joint training, contrastive alignment, generic visual prompts, mixed prompts, and data engines. T-Rex2 is a promising approach to generic object detection, leveraging the complementary strengths of text and visual prompts. It offers strong zero-shot capabilities and is applicable to various scenarios. Future work will focus on improving modality alignment and reducing the number of visual examples required for reliable detection.**T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy** **Authors:** Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang **Institution:** South China University of Technology, International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology, Tsinghua University **Abstract:** T-Rex2 is a highly practical model for open-set object detection that integrates text and visual prompts within a single framework. It leverages contrastive learning to align text and visual prompts, enhancing their complementary strengths. T-Rex2 supports four workflows: text prompt, interactive visual prompt, generic visual prompt, and mixed prompt. Comprehensive experiments demonstrate strong zero-shot object detection capabilities across various scenarios. Text and visual prompts benefit each other, covering different types of objects effectively. The model's API is available at <https://github.com/IDEA-Research/T-Rex>. - **Text Prompted Object Detection:** Models using text prompts for open-vocabulary object detection, leveraging language models like CLIP or BERT. - **Visual Prompted Object Detection:** Models using visual prompts to depict novel objects through concrete examples, but limited by data scarcity and descriptive limitations. - **Interactive Object Detection:** Models that align human intentions in computer vision, allowing users to specify objects through visual prompts. T-Rex2 integrates an image encoder, visual prompt encoder, text prompt encoder, and box decoder. It uses deformable cross-attention for visual prompts and CLIP for text prompts. A contrastive learning module aligns text and visual prompts, enhancing their mutual understanding. T-Rex2 supports four workflows: text prompt, interactive visual prompt, generic visual prompt, and mixed prompt. - **COCO, LVIS, ODinW, Roboflow100:** T-Rex2 demonstrates strong zero-shot object detection capabilities. - **Text Prompt vs. Visual Prompt:** Text prompts excel in common categories, while visual prompts are better for rare categories. - **Interactive Object Detection:** T-Rex2 performs well in dense and small object scenarios. - **Ablation Experiments:** Show the effectiveness of joint training, contrastive alignment, generic visual prompts, mixed prompts, and data engines. T-Rex2 is a promising approach to generic object detection, leveraging the complementary strengths of text and visual prompts. It offers strong zero-shot capabilities and is applicable to various scenarios. Future work will focus on improving modality alignment and reducing the number of visual examples required for reliable detection.
Reach us at info@study.space
[slides] T-Rex2%3A Towards Generic Object Detection via Text-Visual Prompt Synergy | StudySpace