27 May 2024 | Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu
**GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction**
**Authors:** Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu
**Institution:** Tsinghua University, University of California, Berkeley, PhiGent Robotics
**Abstract:**
This paper addresses the challenge of 3D semantic occupancy prediction, which aims to predict the fine-grained geometry and semantics of the surrounding scene for robust autonomous driving. Traditional methods often use dense grids like voxels, which ignore the sparsity of occupancy and the diversity of object scales, leading to inefficient resource allocation. To tackle this, the authors propose an object-centric 3D semantic Gaussian representation, where each Gaussian represents a flexible region of interest with its mean, covariance, and semantic features. The GaussianFormer model uses sparse convolution and cross-attention to transform 2D images into 3D Gaussian representations. An efficient Gaussian-to-voxel splatting module generates dense 3D occupancy predictions by aggregating neighboring Gaussians. Extensive experiments on the nuScenes and KITTI-360 datasets show that GaussianFormer achieves comparable performance to state-of-the-art methods with significantly reduced memory consumption (75.2% - 82.2%).
**Keywords:** 3D occupancy prediction, 3D Gaussian splitting, Autonomous Driving
**Introduction:**
The paper discusses the limitations of existing methods, such as dense voxel representations and bird's-eye-view (BEV) representations, which suffer from redundancy and loss of detail. The proposed 3D Gaussian representation adaptively describes regions of interest, allowing for better resource allocation and efficiency. The authors also introduce the 3D Gaussian splatting method, which efficiently generates 3D occupancy predictions from 3D Gaussians.
**Related Work:**
The paper reviews existing methods for 3D semantic occupancy prediction, including voxel-based methods, BEV-based methods, and 3D Gaussian splatting methods. It highlights the advantages and limitations of each approach, emphasizing the need for a more flexible and efficient representation.
**Proposed Approach:**
The paper details the GaussianFormer model, which includes self-encoding, image cross-attention, and property refinement modules. The model learns meaningful 3D Gaussians from multi-view images and refines their properties iteratively. The Gaussian-to-voxel splatting module efficiently generates 3D occupancy predictions using local aggregation.
**Experiments:**
The paper evaluates GaussianFormer on the nuScenes and KITTI-360 datasets, demonstrating its effectiveness and efficiency. The results show that GaussianFormer achieves comparable performance to state-of-the-art methods while significantly reducing memory consumption.
**Conclusion:**
The paper concludes by discussing the limitations of GaussianFormer and suggesting future directions for improvement. The authors highlight the potential of 3D Gaussians in capturing fine details and efficiently allocating resources, making**GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction**
**Authors:** Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu
**Institution:** Tsinghua University, University of California, Berkeley, PhiGent Robotics
**Abstract:**
This paper addresses the challenge of 3D semantic occupancy prediction, which aims to predict the fine-grained geometry and semantics of the surrounding scene for robust autonomous driving. Traditional methods often use dense grids like voxels, which ignore the sparsity of occupancy and the diversity of object scales, leading to inefficient resource allocation. To tackle this, the authors propose an object-centric 3D semantic Gaussian representation, where each Gaussian represents a flexible region of interest with its mean, covariance, and semantic features. The GaussianFormer model uses sparse convolution and cross-attention to transform 2D images into 3D Gaussian representations. An efficient Gaussian-to-voxel splatting module generates dense 3D occupancy predictions by aggregating neighboring Gaussians. Extensive experiments on the nuScenes and KITTI-360 datasets show that GaussianFormer achieves comparable performance to state-of-the-art methods with significantly reduced memory consumption (75.2% - 82.2%).
**Keywords:** 3D occupancy prediction, 3D Gaussian splitting, Autonomous Driving
**Introduction:**
The paper discusses the limitations of existing methods, such as dense voxel representations and bird's-eye-view (BEV) representations, which suffer from redundancy and loss of detail. The proposed 3D Gaussian representation adaptively describes regions of interest, allowing for better resource allocation and efficiency. The authors also introduce the 3D Gaussian splatting method, which efficiently generates 3D occupancy predictions from 3D Gaussians.
**Related Work:**
The paper reviews existing methods for 3D semantic occupancy prediction, including voxel-based methods, BEV-based methods, and 3D Gaussian splatting methods. It highlights the advantages and limitations of each approach, emphasizing the need for a more flexible and efficient representation.
**Proposed Approach:**
The paper details the GaussianFormer model, which includes self-encoding, image cross-attention, and property refinement modules. The model learns meaningful 3D Gaussians from multi-view images and refines their properties iteratively. The Gaussian-to-voxel splatting module efficiently generates 3D occupancy predictions using local aggregation.
**Experiments:**
The paper evaluates GaussianFormer on the nuScenes and KITTI-360 datasets, demonstrating its effectiveness and efficiency. The results show that GaussianFormer achieves comparable performance to state-of-the-art methods while significantly reducing memory consumption.
**Conclusion:**
The paper concludes by discussing the limitations of GaussianFormer and suggesting future directions for improvement. The authors highlight the potential of 3D Gaussians in capturing fine details and efficiently allocating resources, making