29 Mar 2021 | Xin Chen1 *, Bin Yan1 *, Jiawen Zhu1, Dong Wang1 *, Xiaoyun Yang3 and Huchuan Lu1,2
Transformer Tracking is a novel attention-based feature fusion network designed to improve the accuracy and efficiency of visual object tracking. The method, named TransT, uses a Siamese-like feature extraction backbone, an attention-based fusion mechanism, and a classification and regression head. The key components of TransT include an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. These modules effectively integrate the template and search region features, producing more semantic feature maps than traditional correlation-based methods. TransT achieves very promising results on six challenging datasets, especially on large-scale LaSOT, TrackingNet, and GOT-10k benchmarks. The tracker runs at approximately 50 fps on GPU, making it suitable for real-time applications. The method outperforms state-of-the-art trackers in terms of accuracy and efficiency, demonstrating the effectiveness of the attention-based feature fusion approach. The proposed framework is simple and efficient, and the results show that it significantly improves tracking performance while maintaining real-time speed. The method is evaluated on various benchmarks and compared with other trackers, showing its superiority in terms of accuracy and efficiency. The results indicate that the attention mechanism is more effective than correlation in capturing global information and establishing long-distance feature associations, leading to better tracking performance. The method is also evaluated on other datasets, showing its effectiveness in a wide range of tracking scenarios. The proposed framework is a significant advancement in visual object tracking, offering a new approach that combines the strengths of attention mechanisms with the efficiency of traditional tracking methods.Transformer Tracking is a novel attention-based feature fusion network designed to improve the accuracy and efficiency of visual object tracking. The method, named TransT, uses a Siamese-like feature extraction backbone, an attention-based fusion mechanism, and a classification and regression head. The key components of TransT include an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. These modules effectively integrate the template and search region features, producing more semantic feature maps than traditional correlation-based methods. TransT achieves very promising results on six challenging datasets, especially on large-scale LaSOT, TrackingNet, and GOT-10k benchmarks. The tracker runs at approximately 50 fps on GPU, making it suitable for real-time applications. The method outperforms state-of-the-art trackers in terms of accuracy and efficiency, demonstrating the effectiveness of the attention-based feature fusion approach. The proposed framework is simple and efficient, and the results show that it significantly improves tracking performance while maintaining real-time speed. The method is evaluated on various benchmarks and compared with other trackers, showing its superiority in terms of accuracy and efficiency. The results indicate that the attention mechanism is more effective than correlation in capturing global information and establishing long-distance feature associations, leading to better tracking performance. The method is also evaluated on other datasets, showing its effectiveness in a wide range of tracking scenarios. The proposed framework is a significant advancement in visual object tracking, offering a new approach that combines the strengths of attention mechanisms with the efficiency of traditional tracking methods.