21 Mar 2024 | Peishan Cong, Ziyi Wang, Zhiyang Dou, Yiming Ren, Wei Yin, Kai Cheng, Yujing Sun, Xiaoxiao Long, Xinge Zhu, and Yuexin Ma
LaserHuman is a new large-scale dataset for language-guided scene-aware human motion generation in free environments. It includes real human motions in 3D scenes, free-form neural language descriptions, a mix of indoor and outdoor scenarios, as well as both static and dynamic environments. The dataset was collected using calibrated and synchronized multiple 128-beam LiDARs, RGB cameras, and wearable IMUs to capture the scene and human motions. Annotators with different ages and genders provided detailed linguistic descriptions for motion sequences. The dataset contains 11 diverse 3D scenes, 3,374 high-quality motion sequences, and 12,303 language descriptions.
The task of Scene-Text-to-Motion is extremely challenging, as it requires the model to generate natural and realistic human motions that are semantically consistent with language descriptions and physically plausible with the interacted 3D scenes. To tackle this, the authors propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets. The dataset and code will be released soon.
The dataset provides multi-modal data including videos, 3D scene maps, dynamic LiDAR point clouds, and global human motions for the target person and other participants. The dataset also includes free-form language descriptions for each motion sequence, with two different people providing descriptions for each sequence. The dataset includes diverse scenarios, rich interactions, and abundant free-form language descriptions.
The dataset has four main novel characteristics: real human motions in scenes, diverse human motion categories, diverse interactive environments, and free-form neural language descriptions. The dataset is designed to facilitate research on both static and dynamic scene-conditioned motion generation.
The authors propose a multi-condition fusion module for their diffusion-based generative model, which effectively integrates both scene and textual contexts. The module uses a self-attention layer and a parallel cross-attention mechanism to learn the feature interaction between different modalities. The module is designed to fuse two modalities of conditions and then guide the diffusion process.
The authors evaluate their method on LaserHuman and HUMANISE, achieving state-of-the-art performance. The results show that their method has better results on contact scores, FID, R-score, and translation diversity compared to other methods. The method also shows greater consistency with language instructions for specific actions, such as "grab hand."
The authors also discuss the limitations of their method, including the challenge of achieving precise point-level contact and the need for further research in integrating constraints for both physical and dynamic interactions. They also discuss the potential of diverse modalities for human motion generation and the importance of scene-aware motion generation for animation and humanoid robots.LaserHuman is a new large-scale dataset for language-guided scene-aware human motion generation in free environments. It includes real human motions in 3D scenes, free-form neural language descriptions, a mix of indoor and outdoor scenarios, as well as both static and dynamic environments. The dataset was collected using calibrated and synchronized multiple 128-beam LiDARs, RGB cameras, and wearable IMUs to capture the scene and human motions. Annotators with different ages and genders provided detailed linguistic descriptions for motion sequences. The dataset contains 11 diverse 3D scenes, 3,374 high-quality motion sequences, and 12,303 language descriptions.
The task of Scene-Text-to-Motion is extremely challenging, as it requires the model to generate natural and realistic human motions that are semantically consistent with language descriptions and physically plausible with the interacted 3D scenes. To tackle this, the authors propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets. The dataset and code will be released soon.
The dataset provides multi-modal data including videos, 3D scene maps, dynamic LiDAR point clouds, and global human motions for the target person and other participants. The dataset also includes free-form language descriptions for each motion sequence, with two different people providing descriptions for each sequence. The dataset includes diverse scenarios, rich interactions, and abundant free-form language descriptions.
The dataset has four main novel characteristics: real human motions in scenes, diverse human motion categories, diverse interactive environments, and free-form neural language descriptions. The dataset is designed to facilitate research on both static and dynamic scene-conditioned motion generation.
The authors propose a multi-condition fusion module for their diffusion-based generative model, which effectively integrates both scene and textual contexts. The module uses a self-attention layer and a parallel cross-attention mechanism to learn the feature interaction between different modalities. The module is designed to fuse two modalities of conditions and then guide the diffusion process.
The authors evaluate their method on LaserHuman and HUMANISE, achieving state-of-the-art performance. The results show that their method has better results on contact scores, FID, R-score, and translation diversity compared to other methods. The method also shows greater consistency with language instructions for specific actions, such as "grab hand."
The authors also discuss the limitations of their method, including the challenge of achieving precise point-level contact and the need for further research in integrating constraints for both physical and dynamic interactions. They also discuss the potential of diverse modalities for human motion generation and the importance of scene-aware motion generation for animation and humanoid robots.