DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

11 Jul 2024 | Narek Tumanyan*, Assaf Singer*, Shai Bagon, Tali Dekel
DINO-Tracker is a novel framework for long-term dense point tracking in videos, combining test-time training with pre-trained DINO-ViT features. The method leverages the semantic information captured by DINO to refine features and improve tracking performance. Key contributions include: 1. **Combining Test-Time Training and External Priors**: DINO-Tracker combines test-time training on a single video with the powerful localized semantic features learned by a pre-trained DINO-ViT model. 2. **Refined Features**: The framework refines DINO's features to better fit the motion observations of the test video, enhancing the robustness and accuracy of tracking. 3. **Self-Supervised Losses**: The method uses a combination of self-supervised losses and regularization to retain and benefit from DINO's semantic prior. 4. **State-of-the-Art Performance**: Extensive evaluation demonstrates that DINO-Tracker achieves state-of-the-art results on known benchmarks, outperforming both self-supervised methods and state-of-the-art supervised trackers, especially in challenging cases of long-term occlusions. The paper also includes a detailed methodological section, explaining the Delta-DINO model, the optimization objective, and the evaluation on various benchmarks. The results show that DINO-Tracker outperforms existing methods in terms of position accuracy and occlusion accuracy, particularly in scenarios with long-term occlusions.DINO-Tracker is a novel framework for long-term dense point tracking in videos, combining test-time training with pre-trained DINO-ViT features. The method leverages the semantic information captured by DINO to refine features and improve tracking performance. Key contributions include: 1. **Combining Test-Time Training and External Priors**: DINO-Tracker combines test-time training on a single video with the powerful localized semantic features learned by a pre-trained DINO-ViT model. 2. **Refined Features**: The framework refines DINO's features to better fit the motion observations of the test video, enhancing the robustness and accuracy of tracking. 3. **Self-Supervised Losses**: The method uses a combination of self-supervised losses and regularization to retain and benefit from DINO's semantic prior. 4. **State-of-the-Art Performance**: Extensive evaluation demonstrates that DINO-Tracker achieves state-of-the-art results on known benchmarks, outperforming both self-supervised methods and state-of-the-art supervised trackers, especially in challenging cases of long-term occlusions. The paper also includes a detailed methodological section, explaining the Delta-DINO model, the optimization objective, and the evaluation on various benchmarks. The results show that DINO-Tracker outperforms existing methods in terms of position accuracy and occlusion accuracy, particularly in scenarios with long-term occlusions.
Reach us at info@study.space