15 Jan 2019 | Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, Silvio Savarese
**DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion**
**Authors:** Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, Silvio Savarese
**Institution:** Department of Computer Science, Stanford University; Department of Computer Science, Shanghai Jiao Tong University
**Abstract:**
This paper presents DenseFusion, a generic framework for estimating 6D poses of known objects from RGB-D images. DenseFusion is designed to fully leverage both RGB and depth data sources, addressing the challenge of heavy occlusion and real-time applications. The framework consists of a heterogeneous architecture that processes RGB and depth data separately and a dense fusion network that extracts pixel-wise dense feature embeddings. An end-to-end iterative pose refinement procedure further improves the accuracy of pose estimation while maintaining near real-time inference. Experiments on the YCB-Video and LineMOD datasets demonstrate that DenseFusion outperforms state-of-the-art methods in terms of accuracy and speed. The method is also deployed on a real robot for grasping and manipulation tasks.
**Contributions:**
1. A principled approach to combining color and depth information from RGB-D inputs.
2. An iterative refinement module integrated into the neural network architecture, eliminating the need for post-processing steps.
**Key Contributions:**
- **Dense Fusion:** Combines RGB and depth information at the per-pixel level, enabling the model to reason about local appearance and geometry.
- **Iterative Refinement:** Enhances pose estimation accuracy while maintaining real-time inference speed.
**Methods:**
1. **Semantic Segmentation:** Segments objects in the image using an encoder-decoder architecture.
2. **Dense Feature Extraction:** Extracts color and geometric features separately using CNNs and PointNet-like architectures.
3. **Pixel-wise Dense Fusion:** Fuses color and geometric features at the pixel level to create dense feature embeddings.
4. **Pose Estimation:** Uses a final network to predict 6D poses based on the fused features.
5. **Iterative Refinement:** Refines pose estimates iteratively using a neural network-based approach.
**Experiments:**
- **YCB-Video Dataset:** Evaluates performance on objects with varying shapes and textures under different occlusion conditions.
- **LineMOD Dataset:** Compares with state-of-the-art methods and demonstrates robustness to occlusions.
- **Robotic Grasping:** Evaluates the accuracy of estimated poses for real-world robotic tasks.
**Results:**
- DenseFusion outperforms state-of-the-art methods in both datasets.
- The iterative refinement module significantly improves pose estimation accuracy.
- The method is significantly faster than existing methods, making it suitable for real-time applications.
**Conclusion:**
DenseFusion provides a robust and efficient solution for 6D object pose estimation from RGB-D images, demonstrating superior performance in various challenging**DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion**
**Authors:** Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, Silvio Savarese
**Institution:** Department of Computer Science, Stanford University; Department of Computer Science, Shanghai Jiao Tong University
**Abstract:**
This paper presents DenseFusion, a generic framework for estimating 6D poses of known objects from RGB-D images. DenseFusion is designed to fully leverage both RGB and depth data sources, addressing the challenge of heavy occlusion and real-time applications. The framework consists of a heterogeneous architecture that processes RGB and depth data separately and a dense fusion network that extracts pixel-wise dense feature embeddings. An end-to-end iterative pose refinement procedure further improves the accuracy of pose estimation while maintaining near real-time inference. Experiments on the YCB-Video and LineMOD datasets demonstrate that DenseFusion outperforms state-of-the-art methods in terms of accuracy and speed. The method is also deployed on a real robot for grasping and manipulation tasks.
**Contributions:**
1. A principled approach to combining color and depth information from RGB-D inputs.
2. An iterative refinement module integrated into the neural network architecture, eliminating the need for post-processing steps.
**Key Contributions:**
- **Dense Fusion:** Combines RGB and depth information at the per-pixel level, enabling the model to reason about local appearance and geometry.
- **Iterative Refinement:** Enhances pose estimation accuracy while maintaining real-time inference speed.
**Methods:**
1. **Semantic Segmentation:** Segments objects in the image using an encoder-decoder architecture.
2. **Dense Feature Extraction:** Extracts color and geometric features separately using CNNs and PointNet-like architectures.
3. **Pixel-wise Dense Fusion:** Fuses color and geometric features at the pixel level to create dense feature embeddings.
4. **Pose Estimation:** Uses a final network to predict 6D poses based on the fused features.
5. **Iterative Refinement:** Refines pose estimates iteratively using a neural network-based approach.
**Experiments:**
- **YCB-Video Dataset:** Evaluates performance on objects with varying shapes and textures under different occlusion conditions.
- **LineMOD Dataset:** Compares with state-of-the-art methods and demonstrates robustness to occlusions.
- **Robotic Grasping:** Evaluates the accuracy of estimated poses for real-world robotic tasks.
**Results:**
- DenseFusion outperforms state-of-the-art methods in both datasets.
- The iterative refinement module significantly improves pose estimation accuracy.
- The method is significantly faster than existing methods, making it suitable for real-time applications.
**Conclusion:**
DenseFusion provides a robust and efficient solution for 6D object pose estimation from RGB-D images, demonstrating superior performance in various challenging