29 Mar 2021 | Xin Chen1 *, Bin Yan1 *, Jiawen Zhu1, Dong Wang1 *, Xiaoyun Yang3 and Huchuan Lu1,2
The paper introduces a novel Transformer-based tracking method called TransT, which aims to improve the accuracy and efficiency of visual object tracking. The core innovation is the use of attention mechanisms to fuse template and search region features, replacing the traditional correlation operation. The proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. These modules effectively integrate the features, enhancing the semantic information and reducing the loss of local linear matching. The TransT framework consists of a backbone network for feature extraction, an attention-based fusion network, and a prediction head for classification and regression. Experimental results on several challenging datasets, including LaSOT, TrackingNet, and GOT-10k, demonstrate that TransT outperforms state-of-the-art trackers, achieving high success rates and precision metrics while running at approximately 50 frames per second (fps). The code and models are available online.The paper introduces a novel Transformer-based tracking method called TransT, which aims to improve the accuracy and efficiency of visual object tracking. The core innovation is the use of attention mechanisms to fuse template and search region features, replacing the traditional correlation operation. The proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. These modules effectively integrate the features, enhancing the semantic information and reducing the loss of local linear matching. The TransT framework consists of a backbone network for feature extraction, an attention-based fusion network, and a prediction head for classification and regression. Experimental results on several challenging datasets, including LaSOT, TrackingNet, and GOT-10k, demonstrate that TransT outperforms state-of-the-art trackers, achieving high success rates and precision metrics while running at approximately 50 frames per second (fps). The code and models are available online.