GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping

GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping

14 Mar 2024 | Yuhang Zheng, Xiangyu Chen, Yupeng Zheng, Songen Gu, Runyi Yang, Bu Jin, Pengfei Li, Chengliang Zhong, Zengmao Wang, Lina Liu, Chao Yang, Dawei Wang, Zhen Chen, Xiaoxiao Long*, Meiqing Wang*
**GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping** **Abstract:** This paper presents GaussianGrasper, a novel approach for open-world robotic grasping guided by natural language instructions. The method reconstructs a 3D scene using 3D Gaussian Splatting, which represents the scene as a collection of Gaussian primitives. By processing a limited set of RGB-D views, the method creates a feature field using a tile-based splatting technique. An Efficient Feature Distillation (EFD) module employs contrastive learning to efficiently distill language embeddings from foundational models. The reconstructed geometry of the Gaussian field enables a pre-trained grasping model to generate collision-free grasp pose candidates. A normal-guided grasp module selects the best grasp pose based on the rendered normal. Comprehensive real-world experiments demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks. **Contributions:** - Introduces GaussianGrasper, a robot manipulation system with open-vocabulary semantics and accurate geometry. - Proposes EFD to efficiently distill CLIP features and augment feature fields with SAM segmentation priors. - Proposes a normal-guided grasp module to filter out unfeasible grasp poses. - Demonstrates zero-shot generalization for manipulation tasks in multiple real-world household tabletop scenes. **Related Work:** - Discusses existing methods for grasp pose detection and 3D feature field reconstruction, highlighting their limitations in terms of localization accuracy, data requirements, and scene adaptability. **Methodology:** - **3D Gaussian Splatting:** Initializes 3D Gaussian primitives using Structure from Motion (SfM) and renders them into images. - **Efficient Feature Distillation:** Extracts dense CLIP features and distills them into 3D Gaussian fields using contrastive learning. - **Language-guided Robotic Manipulation:** Locates objects using open-vocabulary queries, generates grasp poses, and updates the scene after manipulation. **Experiments:** - Conducts experiments to validate the effectiveness of the EFD module, geometry reconstruction, and normal-guided grasp. - Demonstrates successful language-guided grasping and efficient scene updating. **Limitations:** - The reconstructed scene remains static and cannot account for unrecordable scene changes. - fails to estimate the depth and normal of transparent objects due to lack of ground truth. **Conclusion:** GaussianGrasper effectively addresses the challenge of open-world robotic grasping guided by natural language instructions, offering a robust and efficient solution for language-guided manipulation tasks.**GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping** **Abstract:** This paper presents GaussianGrasper, a novel approach for open-world robotic grasping guided by natural language instructions. The method reconstructs a 3D scene using 3D Gaussian Splatting, which represents the scene as a collection of Gaussian primitives. By processing a limited set of RGB-D views, the method creates a feature field using a tile-based splatting technique. An Efficient Feature Distillation (EFD) module employs contrastive learning to efficiently distill language embeddings from foundational models. The reconstructed geometry of the Gaussian field enables a pre-trained grasping model to generate collision-free grasp pose candidates. A normal-guided grasp module selects the best grasp pose based on the rendered normal. Comprehensive real-world experiments demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks. **Contributions:** - Introduces GaussianGrasper, a robot manipulation system with open-vocabulary semantics and accurate geometry. - Proposes EFD to efficiently distill CLIP features and augment feature fields with SAM segmentation priors. - Proposes a normal-guided grasp module to filter out unfeasible grasp poses. - Demonstrates zero-shot generalization for manipulation tasks in multiple real-world household tabletop scenes. **Related Work:** - Discusses existing methods for grasp pose detection and 3D feature field reconstruction, highlighting their limitations in terms of localization accuracy, data requirements, and scene adaptability. **Methodology:** - **3D Gaussian Splatting:** Initializes 3D Gaussian primitives using Structure from Motion (SfM) and renders them into images. - **Efficient Feature Distillation:** Extracts dense CLIP features and distills them into 3D Gaussian fields using contrastive learning. - **Language-guided Robotic Manipulation:** Locates objects using open-vocabulary queries, generates grasp poses, and updates the scene after manipulation. **Experiments:** - Conducts experiments to validate the effectiveness of the EFD module, geometry reconstruction, and normal-guided grasp. - Demonstrates successful language-guided grasping and efficient scene updating. **Limitations:** - The reconstructed scene remains static and cannot account for unrecordable scene changes. - fails to estimate the depth and normal of transparent objects due to lack of ground truth. **Conclusion:** GaussianGrasper effectively addresses the challenge of open-world robotic grasping guided by natural language instructions, offering a robust and efficient solution for language-guided manipulation tasks.
Reach us at info@study.space