MonoOcc: Digging into Monocular Semantic Occupancy Prediction

MonoOcc: Digging into Monocular Semantic Occupancy Prediction

2024 | Yupeng Zheng, Xiang Li, Pengfei Li, Yuhang Zheng, Bu Jin, Chengliang Zhong, Xiaoxiao Long, Hao Zhao, and Qichao Zhang
MonoOcc is a monocular semantic occupancy prediction framework that aims to infer complete 3D geometry and semantic information from 2D images. The method addresses the limitations of existing approaches, which rely on complex cascaded frameworks with limited information, leading to suboptimal performance, especially for small and long-tailed objects. To improve performance, MonoOcc introduces an image-conditioned cross-attention module to refine voxel features using visual clues and an auxiliary semantic loss to guide shallow layers of the framework. Additionally, a distillation module is employed to transfer temporal information and richer knowledge from a larger image backbone to the monocular branch with low hardware cost. These improvements enable MonoOcc to achieve state-of-the-art performance on the SemanticKITTI benchmark. The framework also includes a privileged branch that pre-trains a larger image backbone and uses a cross-view transformer to enhance temporal view features. The distillation module transfers knowledge from the privileged branch to the monocular branch, balancing performance and efficiency. MonoOcc outperforms existing methods in semantic occupancy prediction, particularly for small and long-tailed objects, achieving significant improvements in mIoU and other metrics. The method is evaluated on the SemanticKITTI dataset, demonstrating its effectiveness in autonomous driving applications.MonoOcc is a monocular semantic occupancy prediction framework that aims to infer complete 3D geometry and semantic information from 2D images. The method addresses the limitations of existing approaches, which rely on complex cascaded frameworks with limited information, leading to suboptimal performance, especially for small and long-tailed objects. To improve performance, MonoOcc introduces an image-conditioned cross-attention module to refine voxel features using visual clues and an auxiliary semantic loss to guide shallow layers of the framework. Additionally, a distillation module is employed to transfer temporal information and richer knowledge from a larger image backbone to the monocular branch with low hardware cost. These improvements enable MonoOcc to achieve state-of-the-art performance on the SemanticKITTI benchmark. The framework also includes a privileged branch that pre-trains a larger image backbone and uses a cross-view transformer to enhance temporal view features. The distillation module transfers knowledge from the privileged branch to the monocular branch, balancing performance and efficiency. MonoOcc outperforms existing methods in semantic occupancy prediction, particularly for small and long-tailed objects, achieving significant improvements in mIoU and other metrics. The method is evaluated on the SemanticKITTI dataset, demonstrating its effectiveness in autonomous driving applications.
Reach us at info@study.space