21 Mar 2024 | Peishan Cong1, Ziyi Wang1, Zhiyang Dou2,5, Yiming Ren1, Wei Yin3, Kai Cheng4, Yujing Sun5, Xiaoxiao Long5, Xinge Zhu6, and Yuexin Ma1,†
**LaserHuman** is a pioneering dataset designed to advance the field of Scene-Text-to-Motion research. It features large-scale sequences of rich human motions and abundant human interactions captured in various real scenarios with free-form language descriptions, providing valuable data for conditioned human motion generation. The dataset stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation.
To generate semantically consistent and physically plausible human motions, the authors propose a multi-conditional diffusion model. This model effectively integrates both scene and textual contexts, achieving state-of-the-art performance on existing datasets. The dataset and code will be released soon.
The introduction highlights the significance of generating realistic human motions from natural language descriptions in 3D scenes, which has applications in simulation, animation, VR/AR, and robotics. Previous research in human motion generation primarily focuses on generating motions based on a single condition, while this work addresses the challenge of generating motions that are semantically consistent with text descriptions and physically plausible with 3D scenes.
The dataset, LaserHuman, includes 11 diverse 3D scenes, 3,374 high-quality motion sequences, and 12,303 language descriptions. The scenes range from 5 outdoor areas to 4 complexly designed first-floor halls, spanning over 2000 square meters. The motion sequences capture interactions with static and dynamic objects, and the language descriptions are detailed and diverse, covering various actions and interactions.
The multi-condition fusion module in the diffusion-based generative model enhances the consistency of generated motions with both text instructions and 3D scenes. The module uses a self-attention layer and a parallel cross-attention mechanism to integrate scene and text conditions, improving the generation quality.
Experiments on LaserHuman and HUMANISE datasets demonstrate the effectiveness of the proposed method, achieving superior performance in metrics such as contact scores, FID, R-score, and translation diversity. The method also shows better visual appeal and semantic consistency with language instructions.
The discussion section addresses failure cases and suggests future research directions, including physical refinement and the integration of diverse modalities for human motion generation.**LaserHuman** is a pioneering dataset designed to advance the field of Scene-Text-to-Motion research. It features large-scale sequences of rich human motions and abundant human interactions captured in various real scenarios with free-form language descriptions, providing valuable data for conditioned human motion generation. The dataset stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation.
To generate semantically consistent and physically plausible human motions, the authors propose a multi-conditional diffusion model. This model effectively integrates both scene and textual contexts, achieving state-of-the-art performance on existing datasets. The dataset and code will be released soon.
The introduction highlights the significance of generating realistic human motions from natural language descriptions in 3D scenes, which has applications in simulation, animation, VR/AR, and robotics. Previous research in human motion generation primarily focuses on generating motions based on a single condition, while this work addresses the challenge of generating motions that are semantically consistent with text descriptions and physically plausible with 3D scenes.
The dataset, LaserHuman, includes 11 diverse 3D scenes, 3,374 high-quality motion sequences, and 12,303 language descriptions. The scenes range from 5 outdoor areas to 4 complexly designed first-floor halls, spanning over 2000 square meters. The motion sequences capture interactions with static and dynamic objects, and the language descriptions are detailed and diverse, covering various actions and interactions.
The multi-condition fusion module in the diffusion-based generative model enhances the consistency of generated motions with both text instructions and 3D scenes. The module uses a self-attention layer and a parallel cross-attention mechanism to integrate scene and text conditions, improving the generation quality.
Experiments on LaserHuman and HUMANISE datasets demonstrate the effectiveness of the proposed method, achieving superior performance in metrics such as contact scores, FID, R-score, and translation diversity. The method also shows better visual appeal and semantic consistency with language instructions.
The discussion section addresses failure cases and suggests future research directions, including physical refinement and the integration of diverse modalities for human motion generation.