Understanding Unifying Visual and Vision-Language Tracking via Contrastive Learning

The paper "Unifying Visual and Vision-Language Tracking via Contrastive Learning" introduces UVLTrack, a unified tracker designed to handle three reference settings (bounding box, natural language, and both) for single object tracking. The main contributions of the work are: 1. **Unified Framework**: UVLTrack combines visual and vision-language tracking into a single framework, enabling it to handle all three reference settings. 2. **Modality-Unified Feature Extractor**: A Transformer-based feature extractor is designed to jointly learn visual and language features, ensuring consistent feature learning across different modalities. 3. **Multi-Modal Contrastive Loss**: A multi-modal contrastive loss is proposed to align visual and language features into a unified semantic space, enhancing the alignment of different modalities. 4. **Modality-Adaptive Box Head**: A dynamic box head is introduced to dynamically mine scenario features from video contexts, enabling robust target localization in different reference settings. The paper evaluates UVLTrack on various datasets, including visual tracking, vision-language tracking, and visual grounding datasets, demonstrating its effectiveness and robustness across different reference settings. Experimental results show that UVLTrack achieves promising performance, outperforming state-of-the-art trackers in multiple benchmarks. The code and models are available for open-source at https://github.com/OpenSpaceAI/UVLTrack.The paper "Unifying Visual and Vision-Language Tracking via Contrastive Learning" introduces UVLTrack, a unified tracker designed to handle three reference settings (bounding box, natural language, and both) for single object tracking. The main contributions of the work are: 1. **Unified Framework**: UVLTrack combines visual and vision-language tracking into a single framework, enabling it to handle all three reference settings. 2. **Modality-Unified Feature Extractor**: A Transformer-based feature extractor is designed to jointly learn visual and language features, ensuring consistent feature learning across different modalities. 3. **Multi-Modal Contrastive Loss**: A multi-modal contrastive loss is proposed to align visual and language features into a unified semantic space, enhancing the alignment of different modalities. 4. **Modality-Adaptive Box Head**: A dynamic box head is introduced to dynamically mine scenario features from video contexts, enabling robust target localization in different reference settings. The paper evaluates UVLTrack on various datasets, including visual tracking, vision-language tracking, and visual grounding datasets, demonstrating its effectiveness and robustness across different reference settings. Experimental results show that UVLTrack achieves promising performance, outperforming state-of-the-art trackers in multiple benchmarks. The code and models are available for open-source at https://github.com/OpenSpaceAI/UVLTrack.

Unifying Visual and Vision-Language Tracking via Contrastive Learning

2024 | Yinchoa Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang