**GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping**
**Abstract:**
This paper presents GaussianGrasper, a novel approach for open-world robotic grasping guided by natural language instructions. The method reconstructs a 3D scene using 3D Gaussian Splatting, which represents the scene as a collection of Gaussian primitives. By processing a limited set of RGB-D views, the method creates a feature field using a tile-based splatting technique. An Efficient Feature Distillation (EFD) module employs contrastive learning to efficiently distill language embeddings from foundational models. The reconstructed geometry of the Gaussian field enables a pre-trained grasping model to generate collision-free grasp pose candidates. A normal-guided grasp module selects the best grasp pose based on the rendered normal. Comprehensive real-world experiments demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks.
**Contributions:**
- Introduces GaussianGrasper, a robot manipulation system with open-vocabulary semantics and accurate geometry.
- Proposes EFD to efficiently distill CLIP features and augment feature fields with SAM segmentation priors.
- Proposes a normal-guided grasp module to filter out unfeasible grasp poses.
- Demonstrates zero-shot generalization for manipulation tasks in multiple real-world household tabletop scenes.
**Related Work:**
- Discusses existing methods for grasp pose detection and 3D feature field reconstruction, highlighting their limitations in terms of localization accuracy, data requirements, and scene adaptability.
**Methodology:**
- **3D Gaussian Splatting:** Initializes 3D Gaussian primitives using Structure from Motion (SfM) and renders them into images.
- **Efficient Feature Distillation:** Extracts dense CLIP features and distills them into 3D Gaussian fields using contrastive learning.
- **Language-guided Robotic Manipulation:** Locates objects using open-vocabulary queries, generates grasp poses, and updates the scene after manipulation.
**Experiments:**
- Conducts experiments to validate the effectiveness of the EFD module, geometry reconstruction, and normal-guided grasp.
- Demonstrates successful language-guided grasping and efficient scene updating.
**Limitations:**
- The reconstructed scene remains static and cannot account for unrecordable scene changes.
- fails to estimate the depth and normal of transparent objects due to lack of ground truth.
**Conclusion:**
GaussianGrasper effectively addresses the challenge of open-world robotic grasping guided by natural language instructions, offering a robust and efficient solution for language-guided manipulation tasks.**GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping**
**Abstract:**
This paper presents GaussianGrasper, a novel approach for open-world robotic grasping guided by natural language instructions. The method reconstructs a 3D scene using 3D Gaussian Splatting, which represents the scene as a collection of Gaussian primitives. By processing a limited set of RGB-D views, the method creates a feature field using a tile-based splatting technique. An Efficient Feature Distillation (EFD) module employs contrastive learning to efficiently distill language embeddings from foundational models. The reconstructed geometry of the Gaussian field enables a pre-trained grasping model to generate collision-free grasp pose candidates. A normal-guided grasp module selects the best grasp pose based on the rendered normal. Comprehensive real-world experiments demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks.
**Contributions:**
- Introduces GaussianGrasper, a robot manipulation system with open-vocabulary semantics and accurate geometry.
- Proposes EFD to efficiently distill CLIP features and augment feature fields with SAM segmentation priors.
- Proposes a normal-guided grasp module to filter out unfeasible grasp poses.
- Demonstrates zero-shot generalization for manipulation tasks in multiple real-world household tabletop scenes.
**Related Work:**
- Discusses existing methods for grasp pose detection and 3D feature field reconstruction, highlighting their limitations in terms of localization accuracy, data requirements, and scene adaptability.
**Methodology:**
- **3D Gaussian Splatting:** Initializes 3D Gaussian primitives using Structure from Motion (SfM) and renders them into images.
- **Efficient Feature Distillation:** Extracts dense CLIP features and distills them into 3D Gaussian fields using contrastive learning.
- **Language-guided Robotic Manipulation:** Locates objects using open-vocabulary queries, generates grasp poses, and updates the scene after manipulation.
**Experiments:**
- Conducts experiments to validate the effectiveness of the EFD module, geometry reconstruction, and normal-guided grasp.
- Demonstrates successful language-guided grasping and efficient scene updating.
**Limitations:**
- The reconstructed scene remains static and cannot account for unrecordable scene changes.
- fails to estimate the depth and normal of transparent objects due to lack of ground truth.
**Conclusion:**
GaussianGrasper effectively addresses the challenge of open-world robotic grasping guided by natural language instructions, offering a robust and efficient solution for language-guided manipulation tasks.