[slides] Towards Open-World Grasping with Large Vision-Language Models

The paper "Towards Open-World Grasping with Large Vision-Language Models" addresses the challenge of grasping objects in open-ended environments from natural language instructions. The authors propose OWG (Open World Grasper), a system that combines large vision-language models (VLMs) with segmentation and grasp synthesis models to enable zero-shot grasping in cluttered indoor scenes. OWG is designed to handle high-level semantic reasoning and low-level physical-geometric reasoning, addressing the limitations of previous methods that rely on external vision and action models. Key contributions of the work include: 1. **Novel Algorithm**: A novel algorithm for grasping from open-ended language using VLMs. 2. **Extensive Evaluation**: Extensive comparisons and ablation studies in real cluttered indoor scenes, demonstrating the effectiveness of OWG's prompting strategies. 3. **Robot Experiments**: Experiments in both simulation and hardware, showing superior performance compared to previous zero-shot LLM-based methods. The paper also discusses related works on visual prompting for VLMs, the use of LLMs in robotics, and semantics-informed grasping. The method is decomposed into three stages: open-ended referring segmentation, grounded grasp planning, and grasp ranking via contact reasoning. The authors highlight the importance of visual prompting techniques and the role of VLMs in grounding, planning, and reasoning about the scene and object grasps. The results show that OWG outperforms existing methods in terms of grounding accuracy and grasp success rates, particularly in cluttered scenes. The paper concludes by discussing limitations and future directions, including the need for improvements in segmentation and grasp synthesis models, and the exploration of 6-DoF grasp detectors and more sophisticated prompting schemes.The paper "Towards Open-World Grasping with Large Vision-Language Models" addresses the challenge of grasping objects in open-ended environments from natural language instructions. The authors propose OWG (Open World Grasper), a system that combines large vision-language models (VLMs) with segmentation and grasp synthesis models to enable zero-shot grasping in cluttered indoor scenes. OWG is designed to handle high-level semantic reasoning and low-level physical-geometric reasoning, addressing the limitations of previous methods that rely on external vision and action models. Key contributions of the work include: 1. **Novel Algorithm**: A novel algorithm for grasping from open-ended language using VLMs. 2. **Extensive Evaluation**: Extensive comparisons and ablation studies in real cluttered indoor scenes, demonstrating the effectiveness of OWG's prompting strategies. 3. **Robot Experiments**: Experiments in both simulation and hardware, showing superior performance compared to previous zero-shot LLM-based methods. The paper also discusses related works on visual prompting for VLMs, the use of LLMs in robotics, and semantics-informed grasping. The method is decomposed into three stages: open-ended referring segmentation, grounded grasp planning, and grasp ranking via contact reasoning. The authors highlight the importance of visual prompting techniques and the role of VLMs in grounding, planning, and reasoning about the scene and object grasps. The results show that OWG outperforms existing methods in terms of grounding accuracy and grasp success rates, particularly in cluttered scenes. The paper concludes by discussing limitations and future directions, including the need for improvements in segmentation and grasp synthesis models, and the exploration of 6-DoF grasp detectors and more sophisticated prompting schemes.

Towards Open-World Grasping with Large Vision-Language Models

13 Oct 2024 | Georgios Tziafas, Hamidreza Kasaei