19 Mar 2024 | Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Lei Zhang
TAPTR is a simple and strong framework for tracking any point with transformers (TAP). Inspired by DETR-like algorithms, TAPTR models tracking points as queries that are refined layer by layer. Each query consists of a positional part and a content part, with visibility predicted by the updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. The framework also incorporates cost volume from optical flow models and develops simple designs to provide long temporal information while mitigating feature drifting. TAPTR demonstrates strong performance on various TAP datasets with faster inference speed. It outperforms CoTracker on the DAVIS dataset, achieving 63.0 mAP with 1.3 times faster speed. TAPTR also performs better than CoTracker when tracking each single point at a time, maintaining 25 times faster speed. The framework is conceptually simple and effective, with extensive ablation studies showing the importance of various components. TAPTR is evaluated on the TAP-Vid benchmark, showing significant superiority over previous methods across most metrics. The model is also effective in trajectory prediction and video editing tasks.TAPTR is a simple and strong framework for tracking any point with transformers (TAP). Inspired by DETR-like algorithms, TAPTR models tracking points as queries that are refined layer by layer. Each query consists of a positional part and a content part, with visibility predicted by the updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. The framework also incorporates cost volume from optical flow models and develops simple designs to provide long temporal information while mitigating feature drifting. TAPTR demonstrates strong performance on various TAP datasets with faster inference speed. It outperforms CoTracker on the DAVIS dataset, achieving 63.0 mAP with 1.3 times faster speed. TAPTR also performs better than CoTracker when tracking each single point at a time, maintaining 25 times faster speed. The framework is conceptually simple and effective, with extensive ablation studies showing the importance of various components. TAPTR is evaluated on the TAP-Vid benchmark, showing significant superiority over previous methods across most metrics. The model is also effective in trajectory prediction and video editing tasks.