[slides] Semantic Scene Completion from a Single Depth Image

This paper introduces SSCNet, a 3D convolutional neural network for semantic scene completion, which jointly predicts volumetric occupancy and semantic labels from a single depth image. The task involves generating a complete 3D voxel representation of a scene, including both geometric and semantic information. Previous approaches have addressed scene completion and semantic labeling separately, but the authors argue that these tasks are closely related. To address this, they propose a network that uses a 3D dilation-based context module to efficiently expand the receptive field and enable 3D context learning. They also construct a large-scale synthetic dataset, SUNCG, with dense volumetric annotations for training. The network takes a single depth image as input and outputs occupancy and semantic labels for all voxels in the camera view frustum. It uses a 3D convolutional network to extract and aggregate both local geometric and contextual information. The network produces a probability distribution of voxel occupancy and object categories for all voxels inside the camera view frustum. The key design decisions include using a flipped TSDF for encoding the 3D space, which provides a more meaningful signal for the network to learn geometry and scene representation. The network also uses a dilation-based 3D context module to capture higher-level inter-object contextual information. The authors evaluate their method on both real and synthetic datasets, demonstrating that their joint model outperforms methods addressing each task in isolation. They also show that their method significantly outperforms alternative approaches on the semantic scene completion task. The dataset, code, and pretrained model will be available online upon acceptance. The results show that the network can accurately predict both occupancy and semantic labels for a wide range of objects and scenes. The authors also discuss the importance of using synthetic data for training, as well as the benefits of multi-scale aggregation and larger receptive fields in capturing richer contextual information. The paper concludes that understanding object semantics is beneficial for achieving better scene completion, and that scene completion can also help in recognizing objects.This paper introduces SSCNet, a 3D convolutional neural network for semantic scene completion, which jointly predicts volumetric occupancy and semantic labels from a single depth image. The task involves generating a complete 3D voxel representation of a scene, including both geometric and semantic information. Previous approaches have addressed scene completion and semantic labeling separately, but the authors argue that these tasks are closely related. To address this, they propose a network that uses a 3D dilation-based context module to efficiently expand the receptive field and enable 3D context learning. They also construct a large-scale synthetic dataset, SUNCG, with dense volumetric annotations for training. The network takes a single depth image as input and outputs occupancy and semantic labels for all voxels in the camera view frustum. It uses a 3D convolutional network to extract and aggregate both local geometric and contextual information. The network produces a probability distribution of voxel occupancy and object categories for all voxels inside the camera view frustum. The key design decisions include using a flipped TSDF for encoding the 3D space, which provides a more meaningful signal for the network to learn geometry and scene representation. The network also uses a dilation-based 3D context module to capture higher-level inter-object contextual information. The authors evaluate their method on both real and synthetic datasets, demonstrating that their joint model outperforms methods addressing each task in isolation. They also show that their method significantly outperforms alternative approaches on the semantic scene completion task. The dataset, code, and pretrained model will be available online upon acceptance. The results show that the network can accurately predict both occupancy and semantic labels for a wide range of objects and scenes. The authors also discuss the importance of using synthetic data for training, as well as the benefits of multi-scale aggregation and larger receptive fields in capturing richer contextual information. The paper concludes that understanding object semantics is beneficial for achieving better scene completion, and that scene completion can also help in recognizing objects.

Semantic Scene Completion from a Single Depth Image

28 Nov 2016 | Shuran Song Fisher Yu Andy Zeng Angel X. Chang Manolis Savva Thomas Funkhouser