8 Jul 2024 | Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, João Carreira, Andrew Zisserman, Gabriel Brostow, Carl Doersch
The paper introduces TAPVid-3D, a new benchmark for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). Unlike existing benchmarks for 2D point tracking, which primarily focus on real-world videos, TAPVid-3D is the first to address 3D point tracking in real-world scenarios. The benchmark features over 4,000 real-world videos from three different data sources: Aria Digital Twin, DriveTrack, and Panoptic Studio, covering a wide range of object types, motion patterns, and environments. To measure performance, the authors formulate new metrics that extend the Jaccard-based metric used in 2D tracking to handle complexities such as ambiguous depth scales, occlusions, and multi-track spatio-temporal smoothness. The benchmark includes ground-truth 3D trajectories and occlusion information, and the authors manually verify a large sample of trajectories to ensure accuracy. They also construct competitive baselines using existing tracking models to assess the current state of TAP-3D. The paper discusses the limitations and ethical considerations of the benchmark and highlights its potential applications in various fields, including robotic manipulation, video generation, and visual odometry.The paper introduces TAPVid-3D, a new benchmark for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). Unlike existing benchmarks for 2D point tracking, which primarily focus on real-world videos, TAPVid-3D is the first to address 3D point tracking in real-world scenarios. The benchmark features over 4,000 real-world videos from three different data sources: Aria Digital Twin, DriveTrack, and Panoptic Studio, covering a wide range of object types, motion patterns, and environments. To measure performance, the authors formulate new metrics that extend the Jaccard-based metric used in 2D tracking to handle complexities such as ambiguous depth scales, occlusions, and multi-track spatio-temporal smoothness. The benchmark includes ground-truth 3D trajectories and occlusion information, and the authors manually verify a large sample of trajectories to ensure accuracy. They also construct competitive baselines using existing tracking models to assess the current state of TAP-3D. The paper discusses the limitations and ethical considerations of the benchmark and highlights its potential applications in various fields, including robotic manipulation, video generation, and visual odometry.