20 Feb 2024 | Simon Boeder, Fabian Gigengack, Benjamin Risse
OccFlowNet is a novel approach for 3D occupancy estimation using only 2D supervision, which is easier to obtain than 3D voxel labels. The method leverages differentiable volumetric rendering inspired by Neural Radiance Fields (NeRF) to predict depth and semantic maps from 2D images and semantic LiDAR points. It introduces temporal rendering of adjacent time steps to enhance geometric accuracy and provides occupancy flow as a mechanism to handle dynamic objects, ensuring their temporal consistency. The model is trained using 2D labels obtained from LiDAR scans and can also integrate 3D labels for improved performance. Extensive experiments on the Occ3D-nuScenes dataset show that OccFlowNet achieves state-of-the-art performance using only 2D supervision, outperforming previous 2D approaches and significantly surpassing all existing occupancy estimation models when combining 2D and 3D supervision. The method demonstrates that 2D supervision alone is sufficient for effective 3D occupancy estimation, reducing the need for expensive 3D labels and advancing self-supervised learning in this domain. The model's ability to handle dynamic objects through occupancy flow and temporal rendering improves performance, especially for dynamic classes. The results show that rendering supervision can effectively train models for 3D semantic occupancy prediction even in the presence of dynamic objects. The proposed approach bridges the gap between 3D and 2D supervised methods, enabling training without expensive 3D voxel labels and marking a significant step towards self-supervised, purely vision-based methods. Future work includes exploring self-supervised learning with vision-based methods for acquiring 2D labels and jointly optimizing occupancy flow with volume rendering to become independent of labeled 3D boxes.OccFlowNet is a novel approach for 3D occupancy estimation using only 2D supervision, which is easier to obtain than 3D voxel labels. The method leverages differentiable volumetric rendering inspired by Neural Radiance Fields (NeRF) to predict depth and semantic maps from 2D images and semantic LiDAR points. It introduces temporal rendering of adjacent time steps to enhance geometric accuracy and provides occupancy flow as a mechanism to handle dynamic objects, ensuring their temporal consistency. The model is trained using 2D labels obtained from LiDAR scans and can also integrate 3D labels for improved performance. Extensive experiments on the Occ3D-nuScenes dataset show that OccFlowNet achieves state-of-the-art performance using only 2D supervision, outperforming previous 2D approaches and significantly surpassing all existing occupancy estimation models when combining 2D and 3D supervision. The method demonstrates that 2D supervision alone is sufficient for effective 3D occupancy estimation, reducing the need for expensive 3D labels and advancing self-supervised learning in this domain. The model's ability to handle dynamic objects through occupancy flow and temporal rendering improves performance, especially for dynamic classes. The results show that rendering supervision can effectively train models for 3D semantic occupancy prediction even in the presence of dynamic objects. The proposed approach bridges the gap between 3D and 2D supervised methods, enabling training without expensive 3D voxel labels and marking a significant step towards self-supervised, purely vision-based methods. Future work includes exploring self-supervised learning with vision-based methods for acquiring 2D labels and jointly optimizing occupancy flow with volume rendering to become independent of labeled 3D boxes.