The paper introduces TRUMANS (Tracking Human Actions in Scenes), a comprehensive motion-captured dataset for Human-Scene Interaction (HSI) modeling, featuring over 15 hours of diverse human interactions across 100 indoor scenes. The dataset captures whole-body human motions and part-level object dynamics, emphasizing realistic contact. To enhance the dataset, physical environments are digitally replicated into accurate virtual models, and extensive augmentations are applied to both humans and objects while maintaining interaction fidelity.
Based on TRUMANS, the authors propose a diffusion-based autoregressive model for generating HSI sequences of arbitrary length, taking into account scene context and intended actions. The model uses scene and action embeddings as conditions, allowing for efficient and controllable motion synthesis. Experiments demonstrate that the proposed method outperforms existing baselines in terms of quality and zero-shot generalizability on various 3D scene datasets, producing motions that closely mimic original motion-captured data.
The paper also includes a detailed evaluation of the dataset and the proposed method, showing their effectiveness in both static and dynamic settings. Human studies further validate the realism of the synthesized motions, with participants struggling to distinguish them from real MoCap data. Additionally, the dataset's utility in image-based tasks, such as 3D human mesh estimation and contact estimation, is highlighted.The paper introduces TRUMANS (Tracking Human Actions in Scenes), a comprehensive motion-captured dataset for Human-Scene Interaction (HSI) modeling, featuring over 15 hours of diverse human interactions across 100 indoor scenes. The dataset captures whole-body human motions and part-level object dynamics, emphasizing realistic contact. To enhance the dataset, physical environments are digitally replicated into accurate virtual models, and extensive augmentations are applied to both humans and objects while maintaining interaction fidelity.
Based on TRUMANS, the authors propose a diffusion-based autoregressive model for generating HSI sequences of arbitrary length, taking into account scene context and intended actions. The model uses scene and action embeddings as conditions, allowing for efficient and controllable motion synthesis. Experiments demonstrate that the proposed method outperforms existing baselines in terms of quality and zero-shot generalizability on various 3D scene datasets, producing motions that closely mimic original motion-captured data.
The paper also includes a detailed evaluation of the dataset and the proposed method, showing their effectiveness in both static and dynamic settings. Human studies further validate the realism of the synthesized motions, with participants struggling to distinguish them from real MoCap data. Additionally, the dataset's utility in image-based tasks, such as 3D human mesh estimation and contact estimation, is highlighted.