Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

22 May 2024 | Zhu Yu1, Runming Zhang1, Jiacheng Ying1, Junchen Yu1, Xiaohai Hu3, Lun Luo2, Siyuan Cao1, Huiliang Shen1*
The paper introduces CGFormer, a novel neural network for Semantic Scene Completion (SSC) that addresses the limitations of existing sparse-to-dense approaches. CGFormer uses a context and geometry aware voxel transformer (CGVT) to improve the performance of SSC. The CGVT initializes context-dependent queries tailored to individual input images, capturing their unique characteristics and aggregating information within the region of interest. It extends deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. The CGFormer also incorporates multiple 3D representations (voxel and tri-perspective view, TPV) to enhance the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results on the SemanticKITTI and SSCBench-KITTI-360 benchmarks demonstrate that CGFormer achieves state-of-the-art performance, outperforming methods using temporal images or larger image backbone networks. The code for the proposed method is available at <https://github.com/pkqbajng/CGFormer>.The paper introduces CGFormer, a novel neural network for Semantic Scene Completion (SSC) that addresses the limitations of existing sparse-to-dense approaches. CGFormer uses a context and geometry aware voxel transformer (CGVT) to improve the performance of SSC. The CGVT initializes context-dependent queries tailored to individual input images, capturing their unique characteristics and aggregating information within the region of interest. It extends deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. The CGFormer also incorporates multiple 3D representations (voxel and tri-perspective view, TPV) to enhance the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results on the SemanticKITTI and SSCBench-KITTI-360 benchmarks demonstrate that CGFormer achieves state-of-the-art performance, outperforming methods using temporal images or larger image backbone networks. The code for the proposed method is available at <https://github.com/pkqbajng/CGFormer>.
Reach us at info@study.space
[slides and audio] Context and Geometry Aware Voxel Transformer for Semantic Scene Completion