Understanding OccGen%3A Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

**OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving** **Authors:** Guoqing Wang, Zhongdao Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma **Institution:** MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University; Huawei Noah’s Ark Lab **Project Page:** <https://occgen-ad.github.io/> **Abstract:** This paper introduces OccGen, a generative perception model for 3D semantic occupancy prediction. Unlike discriminative methods that focus on mapping inputs to occupancy maps in a single step, OccGen adopts a "noise-to-occupancy" paradigm, progressively refining the occupancy map by predicting and eliminating noise from a random Gaussian distribution. OccGen consists of a conditional encoder and a progressive refinement decoder. The conditional encoder processes multi-modal inputs, while the decoder refines the occupancy map through diffusion denoising. Extensive experiments on multiple benchmarks demonstrate the effectiveness of OccGen, showing improvements of 9.5%, 6.3%, and 13.3% in mIoU on nuScenes-Occupancy under multi-modal, LiDAR-only, and camera-only settings, respectively. OccGen also exhibits desirable properties such as uncertainty estimation and progressive inference. **Keywords:** Occupancy, Generative Model, Diffusion, Multi-modal **Introduction:** 3D semantic occupancy prediction is crucial for autonomous driving systems. Existing methods often treat the task as a one-shot 3D voxel-wise segmentation problem, lacking the ability to refine the occupancy map gradually. OccGen addresses this by using a generative approach, which can model the coarse-to-fine refinement of the dense 3D occupancy map more effectively. **Method:** OccGen's generative pipeline involves a conditional encoder and a progressive refinement decoder. The encoder processes multi-modal inputs, while the decoder refines the occupancy map through diffusion denoising. The diffusion process naturally models the coarse-to-fine refinement, leading to more detailed predictions. **Experiments:** OccGen is evaluated on the nuScenes-Occupancy and SemanticKITTI datasets. Results show that OccGen outperforms state-of-the-art methods, achieving significant improvements in mIoU. Ablation studies and qualitative results further validate the effectiveness of OccGen's components and its performance. **Conclusion:** OccGen is a powerful generative model for 3D semantic occupancy prediction, offering improved performance and desirable properties such as uncertainty estimation and progressive inference.**OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving** **Authors:** Guoqing Wang, Zhongdao Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma **Institution:** MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University; Huawei Noah’s Ark Lab **Project Page:** <https://occgen-ad.github.io/> **Abstract:** This paper introduces OccGen, a generative perception model for 3D semantic occupancy prediction. Unlike discriminative methods that focus on mapping inputs to occupancy maps in a single step, OccGen adopts a "noise-to-occupancy" paradigm, progressively refining the occupancy map by predicting and eliminating noise from a random Gaussian distribution. OccGen consists of a conditional encoder and a progressive refinement decoder. The conditional encoder processes multi-modal inputs, while the decoder refines the occupancy map through diffusion denoising. Extensive experiments on multiple benchmarks demonstrate the effectiveness of OccGen, showing improvements of 9.5%, 6.3%, and 13.3% in mIoU on nuScenes-Occupancy under multi-modal, LiDAR-only, and camera-only settings, respectively. OccGen also exhibits desirable properties such as uncertainty estimation and progressive inference. **Keywords:** Occupancy, Generative Model, Diffusion, Multi-modal **Introduction:** 3D semantic occupancy prediction is crucial for autonomous driving systems. Existing methods often treat the task as a one-shot 3D voxel-wise segmentation problem, lacking the ability to refine the occupancy map gradually. OccGen addresses this by using a generative approach, which can model the coarse-to-fine refinement of the dense 3D occupancy map more effectively. **Method:** OccGen's generative pipeline involves a conditional encoder and a progressive refinement decoder. The encoder processes multi-modal inputs, while the decoder refines the occupancy map through diffusion denoising. The diffusion process naturally models the coarse-to-fine refinement, leading to more detailed predictions. **Experiments:** OccGen is evaluated on the nuScenes-Occupancy and SemanticKITTI datasets. Results show that OccGen outperforms state-of-the-art methods, achieving significant improvements in mIoU. Ablation studies and qualitative results further validate the effectiveness of OccGen's components and its performance. **Conclusion:** OccGen is a powerful generative model for 3D semantic occupancy prediction, offering improved performance and desirable properties such as uncertainty estimation and progressive inference.

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

23 Apr 2024 | Guoqing Wang, Zhongdao Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, and Chao Ma