6 Jan 2024 | Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, Xianxian Li
The paper introduces a novel framework called EVPTrack for visual object tracking, which effectively utilizes spatio-temporal and multi-scale information. EVPTrack addresses the challenge of when-to-update by propagating spatio-temporal information between consecutive frames using tokens, avoiding the need for complex template update mechanisms. The framework includes an Image-Prompt Encoder, a Spatio-Temporal Encoder, and a Prompt Generator. The Image-Prompt Encoder fuses explicit visual prompts with image features, while the Spatio-Temporal Encoder propagates spatio-temporal information. The Prompt Generator extracts information from templates and spatio-temporal tokens to generate multi-scale and spatio-temporal prompts. Experimental results on six benchmarks (LaSOT, LaSOT$_{ext}$, GOT-10k, UAV123, TrackingNet, and TNL2K) demonstrate that EVPTrack achieves competitive performance at real-time speeds, outperforming existing trackers in various challenging scenarios. The method's effectiveness is further validated through ablation studies and qualitative comparisons.The paper introduces a novel framework called EVPTrack for visual object tracking, which effectively utilizes spatio-temporal and multi-scale information. EVPTrack addresses the challenge of when-to-update by propagating spatio-temporal information between consecutive frames using tokens, avoiding the need for complex template update mechanisms. The framework includes an Image-Prompt Encoder, a Spatio-Temporal Encoder, and a Prompt Generator. The Image-Prompt Encoder fuses explicit visual prompts with image features, while the Spatio-Temporal Encoder propagates spatio-temporal information. The Prompt Generator extracts information from templates and spatio-temporal tokens to generate multi-scale and spatio-temporal prompts. Experimental results on six benchmarks (LaSOT, LaSOT$_{ext}$, GOT-10k, UAV123, TrackingNet, and TNL2K) demonstrate that EVPTrack achieves competitive performance at real-time speeds, outperforming existing trackers in various challenging scenarios. The method's effectiveness is further validated through ablation studies and qualitative comparisons.