13 Mar 2024 | Yupeng Zheng1,2*, Xiang Li3*, Pengfei Li3, Yuhang Zheng3, Bu Jin1,2, Chengliang Zhong3, Xiaoxiao Long4†, Hao Zhao3, and Qichao Zhang1,2†
**MonoOcc: Digging into Monocular Semantic Occupancy Prediction**
Monocular Semantic Occupancy Prediction aims to infer the complete 3D geometry and semantic information of scenes from 2D images, which is crucial for enhancing 3D perception in autonomous vehicles. However, existing methods often rely on complex cascaded frameworks with limited information, leading to suboptimal performance, especially for small and long-tailed objects. To address these issues, the authors propose MonoOcc, which includes several key innovations:
1. **Image-Conditioned Cross-Attention Module**: This module refines voxel features by incorporating visual clues from image features, improving the accuracy of depth estimation.
2. **2D Semantic Auxiliary Loss**: This loss function provides deep supervision to the shallow layers of the framework, facilitating better optimization.
3. **Privileged Branch**: This branch uses a pre-trained large image backbone and a Cross View Transformer to enhance temporal view features, providing richer visual information.
4. **Distillation Module**: This module transfers knowledge from the privileged branch to the monocular branch, improving performance while maintaining efficiency.
The proposed method achieves state-of-the-art performance on the SemanticKITTI benchmark, outperforming existing methods in terms of mIoU, particularly for small and long-tailed objects. The authors also provide detailed experimental results and ablation studies to validate the effectiveness of each component.**MonoOcc: Digging into Monocular Semantic Occupancy Prediction**
Monocular Semantic Occupancy Prediction aims to infer the complete 3D geometry and semantic information of scenes from 2D images, which is crucial for enhancing 3D perception in autonomous vehicles. However, existing methods often rely on complex cascaded frameworks with limited information, leading to suboptimal performance, especially for small and long-tailed objects. To address these issues, the authors propose MonoOcc, which includes several key innovations:
1. **Image-Conditioned Cross-Attention Module**: This module refines voxel features by incorporating visual clues from image features, improving the accuracy of depth estimation.
2. **2D Semantic Auxiliary Loss**: This loss function provides deep supervision to the shallow layers of the framework, facilitating better optimization.
3. **Privileged Branch**: This branch uses a pre-trained large image backbone and a Cross View Transformer to enhance temporal view features, providing richer visual information.
4. **Distillation Module**: This module transfers knowledge from the privileged branch to the monocular branch, improving performance while maintaining efficiency.
The proposed method achieves state-of-the-art performance on the SemanticKITTI benchmark, outperforming existing methods in terms of mIoU, particularly for small and long-tailed objects. The authors also provide detailed experimental results and ablation studies to validate the effectiveness of each component.