2 Apr 2024 | Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, Jingya Wang
HOI-M^3: Capturing Multiple Humans and Objects Interaction within Contextual Environment
This paper introduces HOI-M^3, a novel large-scale dataset for modeling interactions between multiple humans and multiple objects. The dataset contains 181 million video frames recorded from 42 diverse viewpoints, covering a wide range of daily scenarios. It provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs, covering 199 sequences and 181M frames of diverse humans and objects under rich activities. The dataset is intended to facilitate various tasks related to human-object interaction perception and generation.
The HOI-M^3 dataset is the first of its kind to open up the research direction for data-driven multiple human-object motion capture or even synthesis. The rich annotations and multi-modality of our dataset also bring huge potential for future direction for HOI modeling and behavior analysis. Based on our novel HOI-M^3 dataset, we provide two strong baseline methods for two novel downstream tasks: 1) monocular capture of multiple HOI; 2) unstructured generation of multiple HOI. For the former, we introduce a novel single-shot learning-based method to estimate multi-person and multi-object 3D poses. For the latter, we tailor the diffusion models to the realm of generating intricate social interactions.
The HOI-M^3 dataset includes 181 million frames featuring 46 subjects engaged in interactions with 90 objects. The dataset provides dense-view coverage at a resolution of 4K and a frame rate of 60 Fps. It includes 199 human-object interacting sequences covering 90 diverse 3D objects and 31 human subjects (20 males and 11 females) across various environments. Noteworthy features of our HOI-M^3 dataset include 1) Multiple Humans and Objects: Each sequence involves a minimum number of 2 persons and 5 objects, which, to the best of our knowledge, is the first real-world 3D multiple human-object datasets with accurate 3D MoCap. 2) High Quality: Sequences are recorded within daily-style rooms with 42 synchronized camera views, and inertial measurement units (IMUs) are embedded in each pre-scanned object to ensure accurate human-object tracking labels. 3) Large Size and Rich Modality: Our dataset records over 20 hours of interactions with both RGB and inertial sensors, providing segmentation annotations, pre-scanned object geometry, and accurate HOI tracking labels.
The HOI-M^3 dataset is designed to capture multiple human-object interactions within a contextual environment. It includes 181 million frames featuring 46 subjects engaged in interactions with 90 objects. The dataset provides dense-view coverage at a resolution of 4K and a frame rate of 60 Fps. We highlight the dataset's advantages in terms of recordingHOI-M^3: Capturing Multiple Humans and Objects Interaction within Contextual Environment
This paper introduces HOI-M^3, a novel large-scale dataset for modeling interactions between multiple humans and multiple objects. The dataset contains 181 million video frames recorded from 42 diverse viewpoints, covering a wide range of daily scenarios. It provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs, covering 199 sequences and 181M frames of diverse humans and objects under rich activities. The dataset is intended to facilitate various tasks related to human-object interaction perception and generation.
The HOI-M^3 dataset is the first of its kind to open up the research direction for data-driven multiple human-object motion capture or even synthesis. The rich annotations and multi-modality of our dataset also bring huge potential for future direction for HOI modeling and behavior analysis. Based on our novel HOI-M^3 dataset, we provide two strong baseline methods for two novel downstream tasks: 1) monocular capture of multiple HOI; 2) unstructured generation of multiple HOI. For the former, we introduce a novel single-shot learning-based method to estimate multi-person and multi-object 3D poses. For the latter, we tailor the diffusion models to the realm of generating intricate social interactions.
The HOI-M^3 dataset includes 181 million frames featuring 46 subjects engaged in interactions with 90 objects. The dataset provides dense-view coverage at a resolution of 4K and a frame rate of 60 Fps. It includes 199 human-object interacting sequences covering 90 diverse 3D objects and 31 human subjects (20 males and 11 females) across various environments. Noteworthy features of our HOI-M^3 dataset include 1) Multiple Humans and Objects: Each sequence involves a minimum number of 2 persons and 5 objects, which, to the best of our knowledge, is the first real-world 3D multiple human-object datasets with accurate 3D MoCap. 2) High Quality: Sequences are recorded within daily-style rooms with 42 synchronized camera views, and inertial measurement units (IMUs) are embedded in each pre-scanned object to ensure accurate human-object tracking labels. 3) Large Size and Rich Modality: Our dataset records over 20 hours of interactions with both RGB and inertial sensors, providing segmentation annotations, pre-scanned object geometry, and accurate HOI tracking labels.
The HOI-M^3 dataset is designed to capture multiple human-object interactions within a contextual environment. It includes 181 million frames featuring 46 subjects engaged in interactions with 90 objects. The dataset provides dense-view coverage at a resolution of 4K and a frame rate of 60 Fps. We highlight the dataset's advantages in terms of recording