2024-01-06 | Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, Xianxian Li
This paper proposes a novel explicit visual prompts framework for visual object tracking, called EVPTrack. The framework effectively exploits both spatio-temporal and multi-scale information to improve tracking performance. EVPTrack uses tokens to propagate spatio-temporal information between consecutive frames without focusing on updating templates, which alleviates the challenge of when-to-update and avoids hyper-parameters associated with updating strategies. Additionally, it utilizes spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. These prompts are fed into a transformer encoder together with image tokens without additional processing, improving the efficiency of the model. The framework also considers multiscale information as explicit visual prompts, providing multiscale template features to enhance the ability to handle target scale changes. Experimental results on six benchmarks (LaSOT, LaSOT_ext, GOT-10k, UAV123, TrackingNet, and TNL2K) show that EVPTrack achieves competitive performance at real-time speed by effectively exploiting both spatio-temporal and multi-scale information. The code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack. The framework is compared with state-of-the-art trackers and shows superior performance on various benchmarks. The method is evaluated on multiple datasets and ablation studies demonstrate the effectiveness of the explicit visual prompts. The framework is efficient and can run in real-time with high performance. The results show that EVPTrack is effective and efficient for visual tracking.This paper proposes a novel explicit visual prompts framework for visual object tracking, called EVPTrack. The framework effectively exploits both spatio-temporal and multi-scale information to improve tracking performance. EVPTrack uses tokens to propagate spatio-temporal information between consecutive frames without focusing on updating templates, which alleviates the challenge of when-to-update and avoids hyper-parameters associated with updating strategies. Additionally, it utilizes spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. These prompts are fed into a transformer encoder together with image tokens without additional processing, improving the efficiency of the model. The framework also considers multiscale information as explicit visual prompts, providing multiscale template features to enhance the ability to handle target scale changes. Experimental results on six benchmarks (LaSOT, LaSOT_ext, GOT-10k, UAV123, TrackingNet, and TNL2K) show that EVPTrack achieves competitive performance at real-time speed by effectively exploiting both spatio-temporal and multi-scale information. The code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack. The framework is compared with state-of-the-art trackers and shows superior performance on various benchmarks. The method is evaluated on multiple datasets and ablation studies demonstrate the effectiveness of the explicit visual prompts. The framework is efficient and can run in real-time with high performance. The results show that EVPTrack is effective and efficient for visual tracking.