DINO-Tracker is a novel framework for long-term dense point tracking in video, combining test-time training on a single video with the power of pre-trained DINO-ViT features. The framework leverages a pre-trained DINO-ViT model to extract semantic features, which are then refined for tracking in a single video. The method uses self-supervised losses and regularization to retain and benefit from DINO's semantic prior. It achieves state-of-the-art results on known benchmarks, outperforming self-supervised methods and being competitive with state-of-the-art supervised trackers, especially in challenging cases of long-term occlusions.
The framework is trained end-to-end using a combination of self-supervised losses, including flow loss, DINO best-buddies loss, refined best-buddies loss, cycle-consistency loss, and prior preservation loss. These losses help refine DINO features to act as "trajectory embeddings," enabling accurate tracking across long-term occlusions and challenging object deformations. The method also includes an occlusion prediction module that determines the visibility of a query point based on trajectory agreement.
DINO-Tracker outperforms existing methods in terms of position accuracy, occlusion accuracy, and average Jaccard index. It is particularly effective in scenarios requiring semantic understanding and handling long-term occlusions. The method is efficient and robust, with a lightweight architecture that allows for fast refinement of pre-trained features. It is also effective in ambiguous regions, where it provides more accurate and semantically consistent trajectories.
The framework is evaluated on several benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and BADJA. It achieves state-of-the-art results in both position accuracy and occlusion accuracy, demonstrating its effectiveness in dense tracking. The method is also effective in handling articulated objects and self-occlusion, which are particularly challenging scenarios for tracking. Overall, DINO-Tracker provides a powerful solution for dense point tracking in video, combining test-time training with external priors to achieve state-of-the-art performance.DINO-Tracker is a novel framework for long-term dense point tracking in video, combining test-time training on a single video with the power of pre-trained DINO-ViT features. The framework leverages a pre-trained DINO-ViT model to extract semantic features, which are then refined for tracking in a single video. The method uses self-supervised losses and regularization to retain and benefit from DINO's semantic prior. It achieves state-of-the-art results on known benchmarks, outperforming self-supervised methods and being competitive with state-of-the-art supervised trackers, especially in challenging cases of long-term occlusions.
The framework is trained end-to-end using a combination of self-supervised losses, including flow loss, DINO best-buddies loss, refined best-buddies loss, cycle-consistency loss, and prior preservation loss. These losses help refine DINO features to act as "trajectory embeddings," enabling accurate tracking across long-term occlusions and challenging object deformations. The method also includes an occlusion prediction module that determines the visibility of a query point based on trajectory agreement.
DINO-Tracker outperforms existing methods in terms of position accuracy, occlusion accuracy, and average Jaccard index. It is particularly effective in scenarios requiring semantic understanding and handling long-term occlusions. The method is efficient and robust, with a lightweight architecture that allows for fast refinement of pre-trained features. It is also effective in ambiguous regions, where it provides more accurate and semantically consistent trajectories.
The framework is evaluated on several benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and BADJA. It achieves state-of-the-art results in both position accuracy and occlusion accuracy, demonstrating its effectiveness in dense tracking. The method is also effective in handling articulated objects and self-occlusion, which are particularly challenging scenarios for tracking. Overall, DINO-Tracker provides a powerful solution for dense point tracking in video, combining test-time training with external priors to achieve state-of-the-art performance.