Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

22 May 2024 | Zhu Yu¹, Runming Zhang¹, Jiacheng Ying¹, Junchen Yu¹, Xiaohai Hu³, Lun Luo², Siyuan Cao¹, Huiliang Shen¹*
This paper proposes a novel context and geometry aware voxel transformer (CGVT) for semantic scene completion (SSC). The CGVT improves the performance of semantic scene completion by initializing queries based on the content of individual input images and extending deformable cross-attention from 2D to 3D pixel space. The CGVT is combined with a neural network named CGFormer to achieve semantic scene completion. CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to enhance the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. The CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks. The CGFormer is evaluated on two datasets: SemanticKITTI and SSCBench-KITTI-360. The results show that CGFormer achieves superior performance in terms of both mIoU and IoU compared to other methods. The CGFormer is also evaluated on a variety of ablation studies, which show that the CGVT and LGE components are crucial for achieving high performance. The CGFormer is also evaluated on qualitative results, which show that CGFormer outperforms other methods in terms of semantic scene completion. The CGFormer is a novel neural network for semantic scene completion that dynamically generates distinct voxel queries, which serve as a good starting point for the attention layers, capturing the unique characteristics of various input images. To improve the accuracy of the estimated depth probability, the CGFormer proposes a simple yet efficient depth refinement module, with minimal computational burden. To boost the semantic and geometric representation abilities, the CGFormer incorporates multiple representations (voxel and TPV) to encode the 3D volumes from both local and global perspectives. The CGFormer is experimentally shown to achieve state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.This paper proposes a novel context and geometry aware voxel transformer (CGVT) for semantic scene completion (SSC). The CGVT improves the performance of semantic scene completion by initializing queries based on the content of individual input images and extending deformable cross-attention from 2D to 3D pixel space. The CGVT is combined with a neural network named CGFormer to achieve semantic scene completion. CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to enhance the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. The CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks. The CGFormer is evaluated on two datasets: SemanticKITTI and SSCBench-KITTI-360. The results show that CGFormer achieves superior performance in terms of both mIoU and IoU compared to other methods. The CGFormer is also evaluated on a variety of ablation studies, which show that the CGVT and LGE components are crucial for achieving high performance. The CGFormer is also evaluated on qualitative results, which show that CGFormer outperforms other methods in terms of semantic scene completion. The CGFormer is a novel neural network for semantic scene completion that dynamically generates distinct voxel queries, which serve as a good starting point for the attention layers, capturing the unique characteristics of various input images. To improve the accuracy of the estimated depth probability, the CGFormer proposes a simple yet efficient depth refinement module, with minimal computational burden. To boost the semantic and geometric representation abilities, the CGFormer incorporates multiple representations (voxel and TPV) to encode the 3D volumes from both local and global perspectives. The CGFormer is experimentally shown to achieve state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.
Reach us at info@study.space
[slides] Context and Geometry Aware Voxel Transformer for Semantic Scene Completion | StudySpace