TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

2 Sep 2024 | Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis
TRAM is a two-stage method for reconstructing the global trajectory and motion of 3D humans from in-the-wild videos. The method improves upon prior work by robustifying Simultaneous Localization and Mapping (SLAM) to recover camera motion in the presence of dynamic humans and using scene background to derive motion scale. It introduces a video transformer model (VIMO) to regress kinematic body motion. By combining the two motions, TRAM achieves accurate 3D human reconstruction in world space, significantly reducing global motion errors. The method uses a metric-scale reference frame derived from the recovered camera to estimate human motion. The second challenge is reconstructing kinematic body motion in the camera frame, which is addressed by leveraging a large pre-trained model and a video transformer model (VIMO) that builds on top of HMR2.0. VIMO adds two temporal transformers to propagate temporal information in the image and motion domains, achieving state-of-the-art reconstruction accuracy. TRAM's approach is scene-centric, using the background to estimate metric-scale camera motion and reconstructing humans in the camera frame. This two-stage approach recovers the global trajectory and motion of 3D humans from videos with a large error reduction in global trajectory compared to prior results. The method is efficient and accurate, improving global human reconstruction by large margins. It also introduces VIMO, a video transformer model for regressing local human body motion, which outperforms prior models in different pose and shape reconstruction benchmarks.TRAM is a two-stage method for reconstructing the global trajectory and motion of 3D humans from in-the-wild videos. The method improves upon prior work by robustifying Simultaneous Localization and Mapping (SLAM) to recover camera motion in the presence of dynamic humans and using scene background to derive motion scale. It introduces a video transformer model (VIMO) to regress kinematic body motion. By combining the two motions, TRAM achieves accurate 3D human reconstruction in world space, significantly reducing global motion errors. The method uses a metric-scale reference frame derived from the recovered camera to estimate human motion. The second challenge is reconstructing kinematic body motion in the camera frame, which is addressed by leveraging a large pre-trained model and a video transformer model (VIMO) that builds on top of HMR2.0. VIMO adds two temporal transformers to propagate temporal information in the image and motion domains, achieving state-of-the-art reconstruction accuracy. TRAM's approach is scene-centric, using the background to estimate metric-scale camera motion and reconstructing humans in the camera frame. This two-stage approach recovers the global trajectory and motion of 3D humans from videos with a large error reduction in global trajectory compared to prior results. The method is efficient and accurate, improving global human reconstruction by large margins. It also introduces VIMO, a video transformer model for regressing local human body motion, which outperforms prior models in different pose and shape reconstruction benchmarks.
Reach us at info@study.space
Understanding TRAM%3A Global Trajectory and Motion of 3D Humans from in-the-wild Videos