1 Dec 2021 | Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, Philip H. S. Torr
This paper introduces a fully-convolutional Siamese network for object tracking, trained end-to-end on the ILSVRC15 dataset for object detection in video. The proposed method achieves state-of-the-art performance in multiple benchmarks while operating at frame-rates beyond real-time. The key idea is to train a Siamese network to locate an exemplar image within a larger search image, and then evaluate this function online during tracking. The network is fully-convolutional with respect to the search image, enabling dense and efficient sliding-window evaluation through a bilinear layer that computes the cross-correlation of its two inputs.
The method is trained using a discriminative approach, with positive and negative pairs of exemplar and search images. The logistic loss is used to train the network, and the loss is computed as the mean of individual losses across the score map. The network is fully-convolutional, which allows it to be evaluated on a larger search image and compute similarity at all translated sub-windows in a single evaluation. This approach is efficient and effective, and the network is trained on a large dataset of annotated videos, making it generalizable to different tracking benchmarks.
The method is evaluated on several benchmarks, including OTB-13, VOT-14, VOT-15, and VOT-16. It outperforms many state-of-the-art methods in terms of accuracy and robustness, and achieves frame-rate speed. The results show that the similarity metric learned by the fully-convolutional Siamese network on ImageNet Video is sufficient to achieve strong results, comparable or superior to recent state-of-the-art methods. The paper also discusses the importance of using a large dataset for training, and how this can improve the performance of the tracker. The method is simple and efficient, and can be used as a complementary approach to more sophisticated online tracking methodologies.This paper introduces a fully-convolutional Siamese network for object tracking, trained end-to-end on the ILSVRC15 dataset for object detection in video. The proposed method achieves state-of-the-art performance in multiple benchmarks while operating at frame-rates beyond real-time. The key idea is to train a Siamese network to locate an exemplar image within a larger search image, and then evaluate this function online during tracking. The network is fully-convolutional with respect to the search image, enabling dense and efficient sliding-window evaluation through a bilinear layer that computes the cross-correlation of its two inputs.
The method is trained using a discriminative approach, with positive and negative pairs of exemplar and search images. The logistic loss is used to train the network, and the loss is computed as the mean of individual losses across the score map. The network is fully-convolutional, which allows it to be evaluated on a larger search image and compute similarity at all translated sub-windows in a single evaluation. This approach is efficient and effective, and the network is trained on a large dataset of annotated videos, making it generalizable to different tracking benchmarks.
The method is evaluated on several benchmarks, including OTB-13, VOT-14, VOT-15, and VOT-16. It outperforms many state-of-the-art methods in terms of accuracy and robustness, and achieves frame-rate speed. The results show that the similarity metric learned by the fully-convolutional Siamese network on ImageNet Video is sufficient to achieve strong results, comparable or superior to recent state-of-the-art methods. The paper also discusses the importance of using a large dataset for training, and how this can improve the performance of the tracker. The method is simple and efficient, and can be used as a complementary approach to more sophisticated online tracking methodologies.