BootsTAP: Bootstrapped Training for Tracking-Any-Point

BootsTAP: Bootstrapped Training for Tracking-Any-Point

23 May 2024 | Carl Doersch¹, Pauline Luc¹, Yi Yang¹, Dilara Gokay¹, Skanda Koppula¹, Ankush Gupta¹, Joseph Heyward¹, Ignacio Rocco¹, Ross Goroshin¹, João Carreira¹, and Andrew Zisserman¹,²
Bootstrapped Training for Tracking-Any-Point (TAP) is a method that improves TAP performance using large-scale, unlabeled real-world data with minimal architectural changes. The approach uses a self-supervised student-teacher setup, where a pre-trained "teacher" model provides pseudo-ground-truth labels for a "student" model. The student is trained to reproduce the teacher's predictions after applying spatial transformations and corruptions to the video. This method achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing previous results by a wide margin. For example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. The method leverages self-consistency as a supervisory signal, using the fact that correct tracks should be consistent with spatial transformations and invariant to the choice of query point along a trajectory. The architecture is outlined in Figure 1, with the teacher model pre-trained on synthetic data and the student model trained on real-world data. The method also includes data augmentations such as affine transformations, JPEG corruption, and frame resizing to increase task difficulty for the student. The teacher model's weights are updated using an exponential moving average of the student's weights. The method is evaluated on the TAP-Vid benchmark and shows significant improvements in occlusion accuracy, position accuracy, and average Jaccard index. The method is also tested on real-world robotic manipulation videos and shows improved performance in tracking textureless objects and handling large changes in scale. The method is released as open-source, with a model and checkpoint available on GitHub. The approach demonstrates that unlabeled real-world data can be used to improve TAP performance, achieving new state-of-the-art results with minimal architectural changes.Bootstrapped Training for Tracking-Any-Point (TAP) is a method that improves TAP performance using large-scale, unlabeled real-world data with minimal architectural changes. The approach uses a self-supervised student-teacher setup, where a pre-trained "teacher" model provides pseudo-ground-truth labels for a "student" model. The student is trained to reproduce the teacher's predictions after applying spatial transformations and corruptions to the video. This method achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing previous results by a wide margin. For example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. The method leverages self-consistency as a supervisory signal, using the fact that correct tracks should be consistent with spatial transformations and invariant to the choice of query point along a trajectory. The architecture is outlined in Figure 1, with the teacher model pre-trained on synthetic data and the student model trained on real-world data. The method also includes data augmentations such as affine transformations, JPEG corruption, and frame resizing to increase task difficulty for the student. The teacher model's weights are updated using an exponential moving average of the student's weights. The method is evaluated on the TAP-Vid benchmark and shows significant improvements in occlusion accuracy, position accuracy, and average Jaccard index. The method is also tested on real-world robotic manipulation videos and shows improved performance in tracking textureless objects and handling large changes in scale. The method is released as open-source, with a model and checkpoint available on GitHub. The approach demonstrates that unlabeled real-world data can be used to improve TAP performance, achieving new state-of-the-art results with minimal architectural changes.
Reach us at info@study.space