20 Feb 2024 | Simon Boeder, Fabian Gigengack, Benjamin Risse
**OccFlowNet: Towards Self-supervised Occupancy Estimation via Differentiable Rendering and Occupancy Flow**
This paper introduces OccFlowNet, a novel approach to occupancy estimation that leverages differentiable volumetric rendering and temporal rendering to achieve state-of-the-art performance using only 2D supervision. The method is inspired by Neural Radiance Fields (NeRF) and aims to address the limitations of existing methods that require large and costly datasets with fine-grained 3D voxel labels.
**Key Contributions:**
1. **Differentiable Volume Rendering:** OccFlowNet uses differentiable volumetric rendering to predict depth and semantic maps, eliminating the need for 3D labels.
2. **Temporal Rendering:** To enhance geometric accuracy and supervisory signal, the method introduces temporal rendering of adjacent time steps.
3. **Occupancy Flow:** This mechanism handles dynamic objects by ensuring their temporal consistency, improving performance in the presence of moving objects.
**Methodology:**
- **Problem Definition:** The goal is to estimate the semantic voxel volume using multi-view images.
- **Model Architecture:** The model transforms input images into semantic occupancy predictions using a combination of 3D CNNs and MLPs.
- **Training:** The model can be trained without 3D voxel labels using 2D supervision, and can also integrate 3D labels if available.
- **Temporal Rendering:** Additional rays from temporally adjacent frames are generated to increase the supervisory signal and improve depth estimation.
- **Occupancy Flow:** This mechanism moves the estimated occupancy of dynamic objects to the correct positions in temporal frames, enhancing the supervision of dynamic objects.
**Experiments:**
- **Dataset:** OccFlowNet is evaluated on the Occ3D-nuScenes benchmark, which provides semantic 3D voxel-based ground truth.
- **Results:** The method achieves competitive performance compared to 3D-based models and outperforms concurrent 2D-based methods. Combining 2D and 3D supervision further improves performance.
**Conclusion:**
OccFlowNet advances occupancy estimation by bridging the gap between 3D and 2D supervised methods, enabling self-supervised learning with purely vision-based approaches. Future work will explore self-supervised learning and the integration of additional features like color estimation and photometric losses.**OccFlowNet: Towards Self-supervised Occupancy Estimation via Differentiable Rendering and Occupancy Flow**
This paper introduces OccFlowNet, a novel approach to occupancy estimation that leverages differentiable volumetric rendering and temporal rendering to achieve state-of-the-art performance using only 2D supervision. The method is inspired by Neural Radiance Fields (NeRF) and aims to address the limitations of existing methods that require large and costly datasets with fine-grained 3D voxel labels.
**Key Contributions:**
1. **Differentiable Volume Rendering:** OccFlowNet uses differentiable volumetric rendering to predict depth and semantic maps, eliminating the need for 3D labels.
2. **Temporal Rendering:** To enhance geometric accuracy and supervisory signal, the method introduces temporal rendering of adjacent time steps.
3. **Occupancy Flow:** This mechanism handles dynamic objects by ensuring their temporal consistency, improving performance in the presence of moving objects.
**Methodology:**
- **Problem Definition:** The goal is to estimate the semantic voxel volume using multi-view images.
- **Model Architecture:** The model transforms input images into semantic occupancy predictions using a combination of 3D CNNs and MLPs.
- **Training:** The model can be trained without 3D voxel labels using 2D supervision, and can also integrate 3D labels if available.
- **Temporal Rendering:** Additional rays from temporally adjacent frames are generated to increase the supervisory signal and improve depth estimation.
- **Occupancy Flow:** This mechanism moves the estimated occupancy of dynamic objects to the correct positions in temporal frames, enhancing the supervision of dynamic objects.
**Experiments:**
- **Dataset:** OccFlowNet is evaluated on the Occ3D-nuScenes benchmark, which provides semantic 3D voxel-based ground truth.
- **Results:** The method achieves competitive performance compared to 3D-based models and outperforms concurrent 2D-based methods. Combining 2D and 3D supervision further improves performance.
**Conclusion:**
OccFlowNet advances occupancy estimation by bridging the gap between 3D and 2D supervised methods, enabling self-supervised learning with purely vision-based approaches. Future work will explore self-supervised learning and the integration of additional features like color estimation and photometric losses.