Understanding RoHM%3A Robust Human Motion Reconstruction via Diffusion

RoHM is a diffusion-based approach for robust 3D human motion reconstruction from monocular RGB-(D) videos in the presence of noise and occlusions. The method addresses the challenge of reconstructing smooth and complete 3D human motion from incomplete and noisy motion estimates, RGB-D, and RGB monocular videos. It leverages diffusion models to denoise and infill both root trajectory in global space and local motion in body-root space for visible and occluded joints, predicting whether feet are in contact with the ground for improved physical plausibility. Compared to baselines like HuMoR, RoHM reconstructs more plausible motions that faithfully match image evidence, especially under heavy occlusions. The method decomposes the problem into two sub-tasks: global trajectory and local motion. It introduces a novel conditioning module that combines with an iterative inference scheme to capture correlations between the two. RoHM is applied to various tasks, including motion reconstruction, denoising, spatial and temporal infilling. Extensive experiments on three popular datasets show that RoHM outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html. RoHM uses off-the-shelf regressors and/or per-frame optimization to obtain initial SMPL-X estimates for each frame. These estimates are noisy, inaccurate under occlusions, and temporally inconsistent. Given these estimates, the goal is to generate realistic motions in consistent global coordinates. The method learns two diffusion-based models, TrajNet and PoseNet, to reconstruct global trajectory and local motion. A flexible conditioning module, TrajControl, is introduced to refine motion plausibility. The method also incorporates test-time score guidance to improve physical plausibility and accuracy of reconstructed motions. Experiments on AMASS, PROX, and EgoBody datasets show that RoHM achieves superior accuracy and realism compared to state-of-the-art optimization-based methods, while being 30 times faster at inference time. The method is robust to noise and occlusions, and performs well in challenging scenarios. The approach is trained using a curriculum learning scheme, gradually increasing noise levels and occlusion rates as training progresses. The method is evaluated using various metrics, including accuracy, physical plausibility, and foot-ground contact. Results show that RoHM significantly reduces foot skating and improves motion plausibility compared to baselines. The method is also efficient, with significantly reduced run-time compared to HuMoR.RoHM is a diffusion-based approach for robust 3D human motion reconstruction from monocular RGB-(D) videos in the presence of noise and occlusions. The method addresses the challenge of reconstructing smooth and complete 3D human motion from incomplete and noisy motion estimates, RGB-D, and RGB monocular videos. It leverages diffusion models to denoise and infill both root trajectory in global space and local motion in body-root space for visible and occluded joints, predicting whether feet are in contact with the ground for improved physical plausibility. Compared to baselines like HuMoR, RoHM reconstructs more plausible motions that faithfully match image evidence, especially under heavy occlusions. The method decomposes the problem into two sub-tasks: global trajectory and local motion. It introduces a novel conditioning module that combines with an iterative inference scheme to capture correlations between the two. RoHM is applied to various tasks, including motion reconstruction, denoising, spatial and temporal infilling. Extensive experiments on three popular datasets show that RoHM outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html. RoHM uses off-the-shelf regressors and/or per-frame optimization to obtain initial SMPL-X estimates for each frame. These estimates are noisy, inaccurate under occlusions, and temporally inconsistent. Given these estimates, the goal is to generate realistic motions in consistent global coordinates. The method learns two diffusion-based models, TrajNet and PoseNet, to reconstruct global trajectory and local motion. A flexible conditioning module, TrajControl, is introduced to refine motion plausibility. The method also incorporates test-time score guidance to improve physical plausibility and accuracy of reconstructed motions. Experiments on AMASS, PROX, and EgoBody datasets show that RoHM achieves superior accuracy and realism compared to state-of-the-art optimization-based methods, while being 30 times faster at inference time. The method is robust to noise and occlusions, and performs well in challenging scenarios. The approach is trained using a curriculum learning scheme, gradually increasing noise levels and occlusion rates as training progresses. The method is evaluated using various metrics, including accuracy, physical plausibility, and foot-ground contact. Results show that RoHM significantly reduces foot skating and improves motion plausibility compared to baselines. The method is also efficient, with significantly reduced run-time compared to HuMoR.

RoHM: Robust Human Motion Reconstruction via Diffusion

15 Apr 2024 | Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlec