LocoTrack is a highly accurate and efficient model for point tracking across video sequences. It addresses the limitations of existing methods that rely on local 2D correlation maps, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack introduces a novel approach using all-pair correspondences across regions, leveraging local 4D correlation to establish precise correspondences. This method enhances robustness against ambiguities through bidirectional correspondence and matching smoothness. A lightweight correlation encoder improves computational efficiency, while a compact Transformer architecture integrates long-term temporal information. LocoTrack achieves unmatched accuracy on TAP-Vid benchmarks and operates significantly faster than state-of-the-art methods.
The model is designed to handle point correspondence across different views of a scene, with applications in 3D reconstruction, autonomous driving, and pose estimation. It processes input videos and query points to determine the corresponding positions in each frame, along with visibility status. LocoTrack uses a two-stage approach: track initialization and track refinement. The initialization stage uses global correlation maps to determine initial positions, while the refinement stage iteratively improves the track using local 4D correlation and a Transformer for temporal modeling.
LocoTrack outperforms existing methods in terms of accuracy and efficiency. It achieves a +2.5 AJ increase in the TAP-Vid-DAVIS dataset compared to Cotracker and offers 6× faster inference speed. It also surpasses TAPIR by +5.6 AJ with 3.5× faster inference. The model's lightweight architecture and efficient processing make it suitable for real-time applications. LocoTrack demonstrates superior performance in handling long-range tracking, even in the presence of occlusions and challenging scenarios. The model's ability to process long videos efficiently and its robustness against matching ambiguities make it a significant advancement in point tracking technology.LocoTrack is a highly accurate and efficient model for point tracking across video sequences. It addresses the limitations of existing methods that rely on local 2D correlation maps, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack introduces a novel approach using all-pair correspondences across regions, leveraging local 4D correlation to establish precise correspondences. This method enhances robustness against ambiguities through bidirectional correspondence and matching smoothness. A lightweight correlation encoder improves computational efficiency, while a compact Transformer architecture integrates long-term temporal information. LocoTrack achieves unmatched accuracy on TAP-Vid benchmarks and operates significantly faster than state-of-the-art methods.
The model is designed to handle point correspondence across different views of a scene, with applications in 3D reconstruction, autonomous driving, and pose estimation. It processes input videos and query points to determine the corresponding positions in each frame, along with visibility status. LocoTrack uses a two-stage approach: track initialization and track refinement. The initialization stage uses global correlation maps to determine initial positions, while the refinement stage iteratively improves the track using local 4D correlation and a Transformer for temporal modeling.
LocoTrack outperforms existing methods in terms of accuracy and efficiency. It achieves a +2.5 AJ increase in the TAP-Vid-DAVIS dataset compared to Cotracker and offers 6× faster inference speed. It also surpasses TAPIR by +5.6 AJ with 3.5× faster inference. The model's lightweight architecture and efficient processing make it suitable for real-time applications. LocoTrack demonstrates superior performance in handling long-range tracking, even in the presence of occlusions and challenging scenarios. The model's ability to process long videos efficiently and its robustness against matching ambiguities make it a significant advancement in point tracking technology.