8 Jun 2020 | Goutam Bhat*, Martin Danelljan*, Luc Van Gool, Radu Timofte
This paper presents a novel end-to-end trainable tracking architecture that improves target-background discriminability by fully exploiting both target and background appearance information. The proposed method, called DiMP, addresses the limitations of existing Siamese-based trackers that only use target appearance information. The architecture is derived from a discriminative learning loss and uses an iterative optimization process to predict a powerful model in just a few iterations. The method also learns the discriminative loss itself, enabling effective end-to-end training. The proposed tracker achieves state-of-the-art performance on six tracking benchmarks, with an EAO score of 0.440 on VOT2018 while running at over 40 FPS. The code and models are available at https://github.com/visionml/pytracking.
The key contributions of the paper include: (1) a discriminative model prediction architecture that can fully exploit both target and background appearance information for target model prediction; (2) an end-to-end training approach that learns the discriminative loss itself; (3) a steepest descent based optimizer that enables rapid convergence; (4) an effective model initializer that provides a reasonable initial estimate of the target model; and (5) a flexible architecture that can be adapted to different tracking scenarios.
The method is evaluated on seven tracking benchmarks, including VOT2018, LaSOT, TrackingNet, GOT10k, NFS, OTB-100, and UAV123. The results show that the proposed method outperforms existing state-of-the-art approaches in terms of accuracy, robustness, and generalization. The method achieves the best AUC scores on all benchmarks, with the highest performance on VOT2018. The method is also efficient, running at over 40 FPS on a single GPU. The paper also provides an extensive experimental analysis of the proposed architecture, showing the impact of each component. The results demonstrate that the proposed method is effective for tracking in a wide range of scenarios, including long sequences, fast-moving objects, and low altitude aerial videos.This paper presents a novel end-to-end trainable tracking architecture that improves target-background discriminability by fully exploiting both target and background appearance information. The proposed method, called DiMP, addresses the limitations of existing Siamese-based trackers that only use target appearance information. The architecture is derived from a discriminative learning loss and uses an iterative optimization process to predict a powerful model in just a few iterations. The method also learns the discriminative loss itself, enabling effective end-to-end training. The proposed tracker achieves state-of-the-art performance on six tracking benchmarks, with an EAO score of 0.440 on VOT2018 while running at over 40 FPS. The code and models are available at https://github.com/visionml/pytracking.
The key contributions of the paper include: (1) a discriminative model prediction architecture that can fully exploit both target and background appearance information for target model prediction; (2) an end-to-end training approach that learns the discriminative loss itself; (3) a steepest descent based optimizer that enables rapid convergence; (4) an effective model initializer that provides a reasonable initial estimate of the target model; and (5) a flexible architecture that can be adapted to different tracking scenarios.
The method is evaluated on seven tracking benchmarks, including VOT2018, LaSOT, TrackingNet, GOT10k, NFS, OTB-100, and UAV123. The results show that the proposed method outperforms existing state-of-the-art approaches in terms of accuracy, robustness, and generalization. The method achieves the best AUC scores on all benchmarks, with the highest performance on VOT2018. The method is also efficient, running at over 40 FPS on a single GPU. The paper also provides an extensive experimental analysis of the proposed architecture, showing the impact of each component. The results demonstrate that the proposed method is effective for tracking in a wide range of scenarios, including long sequences, fast-moving objects, and low altitude aerial videos.