**Abstract:**
LocoTrack is a novel model designed for tracking any point (TAP) across video sequences, addressing the limitations of existing methods that rely on local 2D correlation maps. LocoTrack employs a local all-pair correspondence formulation using 4D correlation to establish precise and bidirectional correspondences, enhancing robustness against matching ambiguities. The model incorporates a lightweight correlation encoder and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves superior accuracy on all TAP-Vid benchmarks and operates at a speed 6 times faster than current state-of-the-art methods.
**Introduction:**
Point correspondence is a fundamental problem in computer vision, crucial for applications like 3D reconstruction, autonomous driving, and pose estimation. Recent methods often use 2D local correlation maps, which struggle with homogeneous regions and repetitive features. LocoTrack overcomes these challenges by leveraging dense correspondence, specifically 4D correlation, to establish robust and smooth correspondences. The model is designed to maintain efficiency by restricting the search space to local neighborhoods and using a Transformer to handle temporal context.
**Method:**
LocoTrack takes a query point and a video as input, aiming to produce a track and occlusion probabilities. The process involves two stages: track initialization and refinement. In the initialization stage, a global correlation map is used to estimate initial track positions. In the refinement stage, local 4D correlation is used to refine the tracks iteratively, leveraging the smoothness of both query and target dimensions. The Transformer is used to model temporal context, handling variable sequence lengths and long-range dependencies efficiently.
**Experiments:**
LocoTrack is evaluated on the TAP-Vid benchmark and the RoboTAP dataset. It demonstrates superior performance in terms of position accuracy, occlusion accuracy, and average Jaccard index compared to state-of-the-art methods. The model also shows high efficiency, with a small variant achieving a +2.5 AJ improvement over TAPIR and a +5.6 AJ improvement over CoTracker, while operating at 6 times faster inference speed.
**Conclusion:**
LocoTrack addresses the shortcomings of existing point tracking methods by leveraging dense correspondence and efficient computation. It achieves superior performance and real-time inference, making it a promising approach for point tracking tasks.**Abstract:**
LocoTrack is a novel model designed for tracking any point (TAP) across video sequences, addressing the limitations of existing methods that rely on local 2D correlation maps. LocoTrack employs a local all-pair correspondence formulation using 4D correlation to establish precise and bidirectional correspondences, enhancing robustness against matching ambiguities. The model incorporates a lightweight correlation encoder and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves superior accuracy on all TAP-Vid benchmarks and operates at a speed 6 times faster than current state-of-the-art methods.
**Introduction:**
Point correspondence is a fundamental problem in computer vision, crucial for applications like 3D reconstruction, autonomous driving, and pose estimation. Recent methods often use 2D local correlation maps, which struggle with homogeneous regions and repetitive features. LocoTrack overcomes these challenges by leveraging dense correspondence, specifically 4D correlation, to establish robust and smooth correspondences. The model is designed to maintain efficiency by restricting the search space to local neighborhoods and using a Transformer to handle temporal context.
**Method:**
LocoTrack takes a query point and a video as input, aiming to produce a track and occlusion probabilities. The process involves two stages: track initialization and refinement. In the initialization stage, a global correlation map is used to estimate initial track positions. In the refinement stage, local 4D correlation is used to refine the tracks iteratively, leveraging the smoothness of both query and target dimensions. The Transformer is used to model temporal context, handling variable sequence lengths and long-range dependencies efficiently.
**Experiments:**
LocoTrack is evaluated on the TAP-Vid benchmark and the RoboTAP dataset. It demonstrates superior performance in terms of position accuracy, occlusion accuracy, and average Jaccard index compared to state-of-the-art methods. The model also shows high efficiency, with a small variant achieving a +2.5 AJ improvement over TAPIR and a +5.6 AJ improvement over CoTracker, while operating at 6 times faster inference speed.
**Conclusion:**
LocoTrack addresses the shortcomings of existing point tracking methods by leveraging dense correspondence and efficient computation. It achieves superior performance and real-time inference, making it a promising approach for point tracking tasks.