GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

27 May 2024 | Yuanhui Huang, Wenzhao Zheng, Yinpeng Zhang, Jie Zhou, and Jiwen Lu
GaussianFormer is a novel method for 3D semantic occupancy prediction using sparse 3D Gaussian representations. The method employs a set of 3D semantic Gaussians to describe the scene, where each Gaussian represents a flexible region of interest with mean, covariance, and semantic features. The model uses sparse convolution and cross-attention to transform 2D images into 3D Gaussian representations. A Gaussian-to-voxel splatting module efficiently generates dense 3D occupancy predictions by aggregating neighboring Gaussians. GaussianFormer achieves comparable performance to state-of-the-art methods with significantly reduced memory consumption (17.8%–24.8%). The method is evaluated on the nuScenes and KITTI-360 datasets, demonstrating effective 3D semantic occupancy prediction from surrounding and monocular cameras. The model's efficiency and accuracy make it suitable for autonomous driving applications. The method's object-centric representation allows for flexible adaptation to varying object scales and region complexities, improving resource allocation and efficiency. The GaussianFormer model is trained end-to-end with cross entropy and lovasz-softmax losses, achieving high performance with fewer Gaussians and reduced computational overhead. The method's results show that GaussianFormer can generate realistic and holistic scene perceptions, capturing fine details of object shapes and efficiently allocating computation and storage resources. The method's limitations include slightly lower performance compared to state-of-the-art methods and the need for a large number of Gaussians for satisfactory results. Overall, GaussianFormer provides an efficient and effective solution for 3D semantic occupancy prediction in autonomous driving.GaussianFormer is a novel method for 3D semantic occupancy prediction using sparse 3D Gaussian representations. The method employs a set of 3D semantic Gaussians to describe the scene, where each Gaussian represents a flexible region of interest with mean, covariance, and semantic features. The model uses sparse convolution and cross-attention to transform 2D images into 3D Gaussian representations. A Gaussian-to-voxel splatting module efficiently generates dense 3D occupancy predictions by aggregating neighboring Gaussians. GaussianFormer achieves comparable performance to state-of-the-art methods with significantly reduced memory consumption (17.8%–24.8%). The method is evaluated on the nuScenes and KITTI-360 datasets, demonstrating effective 3D semantic occupancy prediction from surrounding and monocular cameras. The model's efficiency and accuracy make it suitable for autonomous driving applications. The method's object-centric representation allows for flexible adaptation to varying object scales and region complexities, improving resource allocation and efficiency. The GaussianFormer model is trained end-to-end with cross entropy and lovasz-softmax losses, achieving high performance with fewer Gaussians and reduced computational overhead. The method's results show that GaussianFormer can generate realistic and holistic scene perceptions, capturing fine details of object shapes and efficiently allocating computation and storage resources. The method's limitations include slightly lower performance compared to state-of-the-art methods and the need for a large number of Gaussians for satisfactory results. Overall, GaussianFormer provides an efficient and effective solution for 3D semantic occupancy prediction in autonomous driving.
Reach us at info@study.space