Unifying Visual and Vision-Language Tracking via Contrastive Learning

Unifying Visual and Vision-Language Tracking via Contrastive Learning

20 Jan 2024 | Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang
This paper proposes a unified visual and vision-language tracking framework called UVLTrack, which can simultaneously handle three types of target reference settings: BBOX, natural language (NL), and NL+BBOX. The proposed UVLTrack features a modality-unified feature extractor that learns joint visual and language features and a modality-adaptive box head that dynamically mines scenario features from video contexts to localize the target in a contrastive way. The modality-unified feature extractor is based on a Transformer architecture, where visual and language features are extracted separately in shallow encoder layers and fused in deep encoder layers. A multi-modal contrastive loss is introduced to align visual and language features into a unified semantic space. The modality-adaptive box head dynamically uses different modal references to mine scenario features and localize the target in a contrastive manner. Extensive experiments on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets show that UVLTrack achieves promising performance compared to modality-specific trackers. The framework is efficient, achieving high FPS for both visual and vision-language tracking. The results demonstrate that UVLTrack is effective across all three reference settings.This paper proposes a unified visual and vision-language tracking framework called UVLTrack, which can simultaneously handle three types of target reference settings: BBOX, natural language (NL), and NL+BBOX. The proposed UVLTrack features a modality-unified feature extractor that learns joint visual and language features and a modality-adaptive box head that dynamically mines scenario features from video contexts to localize the target in a contrastive way. The modality-unified feature extractor is based on a Transformer architecture, where visual and language features are extracted separately in shallow encoder layers and fused in deep encoder layers. A multi-modal contrastive loss is introduced to align visual and language features into a unified semantic space. The modality-adaptive box head dynamically uses different modal references to mine scenario features and localize the target in a contrastive manner. Extensive experiments on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets show that UVLTrack achieves promising performance compared to modality-specific trackers. The framework is efficient, achieving high FPS for both visual and vision-language tracking. The results demonstrate that UVLTrack is effective across all three reference settings.
Reach us at info@study.space
[slides and audio] Unifying Visual and Vision-Language Tracking via Contrastive Learning