Towards Open-World Grasping with Large Vision-Language Models

Towards Open-World Grasping with Large Vision-Language Models

13 Oct 2024 | Georgios Tziavas, Hamidreza Kasaei
This paper introduces OWG, a novel system for open-world grasping that integrates large vision-language models (VLMs) with segmentation and grasp synthesis models. The system enables zero-shot grasping in open-ended environments by combining high-level semantic reasoning with low-level physical-geometric reasoning. OWG operates in three stages: open-ended referring segmentation, grounded grasp planning, and grasp ranking via contact reasoning. The system uses visual prompts to ground language instructions in the environment and generate grasps. The approach is evaluated on cluttered indoor scenes and demonstrates superior performance compared to previous supervised and zero-shot LLM-based methods. The system is also tested in simulation and on real robots, showing robustness in grounding from open-ended language and effective grasping in both scenarios. The paper highlights the potential of VLMs for open-world grasping, particularly in handling complex object relationships and contact reasoning. The results show that OWG outperforms existing methods in terms of success rates and demonstrates the effectiveness of VLMs in enabling zero-shot grasping in open-world settings. The study also identifies limitations of current VLMs, such as their struggle with complex object relationships, and suggests future directions for improving grounding and contact reasoning in robotic grasping.This paper introduces OWG, a novel system for open-world grasping that integrates large vision-language models (VLMs) with segmentation and grasp synthesis models. The system enables zero-shot grasping in open-ended environments by combining high-level semantic reasoning with low-level physical-geometric reasoning. OWG operates in three stages: open-ended referring segmentation, grounded grasp planning, and grasp ranking via contact reasoning. The system uses visual prompts to ground language instructions in the environment and generate grasps. The approach is evaluated on cluttered indoor scenes and demonstrates superior performance compared to previous supervised and zero-shot LLM-based methods. The system is also tested in simulation and on real robots, showing robustness in grounding from open-ended language and effective grasping in both scenarios. The paper highlights the potential of VLMs for open-world grasping, particularly in handling complex object relationships and contact reasoning. The results show that OWG outperforms existing methods in terms of success rates and demonstrates the effectiveness of VLMs in enabling zero-shot grasping in open-world settings. The study also identifies limitations of current VLMs, such as their struggle with complex object relationships, and suggests future directions for improving grounding and contact reasoning in robotic grasping.
Reach us at info@futurestudyspace.com