Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers

Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers

15 Mar 2024 | Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, Rongrong Ji
AQA-Track is an adaptive tracker that uses spatio-temporal transformers to effectively learn spatio-temporal information without many hand-designed components. The tracker introduces learnable and autoregressive queries to capture instantaneous target appearance changes in a sliding window fashion. A novel attention mechanism is designed to generate new queries in the current frame based on existing queries. A spatio-temporal information fusion module (STM) is then used to combine static appearance and instantaneous changes to locate the target. The tracker achieves state-of-the-art performance on six popular tracking benchmarks: LaSOT, LaSOT_ext, TrackingNet, GOT-10k, TNL2K, and UAV123. The tracker outperforms other methods in terms of AUC scores, particularly on long-term benchmarks like LaSOT. The method is effective in capturing target state changes and motion trends, and it demonstrates robust tracking performance in challenging scenarios such as camera motion and motion blur. The tracker is implemented using PyTorch and trained on four datasets, including LaSOT, COCO, TrackingNet, and GOT-10k. The model variants include AQATrack-256 and AQATrack-384, which achieve high AUC scores on the benchmarks. The tracker is efficient and can run in real-time at over 65 fps. The method is validated through extensive experiments, and the results show that the proposed approach is effective in mining spatio-temporal information for visual tracking. The tracker is also compared with other state-of-the-art methods, and it demonstrates superior performance in terms of accuracy and robustness. The method is designed to be efficient and effective in capturing spatio-temporal information, making it a promising approach for visual tracking tasks.AQA-Track is an adaptive tracker that uses spatio-temporal transformers to effectively learn spatio-temporal information without many hand-designed components. The tracker introduces learnable and autoregressive queries to capture instantaneous target appearance changes in a sliding window fashion. A novel attention mechanism is designed to generate new queries in the current frame based on existing queries. A spatio-temporal information fusion module (STM) is then used to combine static appearance and instantaneous changes to locate the target. The tracker achieves state-of-the-art performance on six popular tracking benchmarks: LaSOT, LaSOT_ext, TrackingNet, GOT-10k, TNL2K, and UAV123. The tracker outperforms other methods in terms of AUC scores, particularly on long-term benchmarks like LaSOT. The method is effective in capturing target state changes and motion trends, and it demonstrates robust tracking performance in challenging scenarios such as camera motion and motion blur. The tracker is implemented using PyTorch and trained on four datasets, including LaSOT, COCO, TrackingNet, and GOT-10k. The model variants include AQATrack-256 and AQATrack-384, which achieve high AUC scores on the benchmarks. The tracker is efficient and can run in real-time at over 65 fps. The method is validated through extensive experiments, and the results show that the proposed approach is effective in mining spatio-temporal information for visual tracking. The tracker is also compared with other state-of-the-art methods, and it demonstrates superior performance in terms of accuracy and robustness. The method is designed to be efficient and effective in capturing spatio-temporal information, making it a promising approach for visual tracking tasks.
Reach us at info@study.space