[slides and audio] Reasoning Grasping via Multimodal Large Language Model

This paper introduces a novel task called Reasoning Grasping, where robots must generate grasp poses based on implicit verbal instructions or intentions. The proposed model integrates a multi-modal Large Language Model (LLM) with a vision-based robotic grasping framework, enabling the robot to interpret complex and implicit instructions and accurately predict grasp poses for target objects or specific parts within cluttered environments. The model outputs grasp poses using special tokens [SPT] to indicate the grasping target. The model is trained on a new reasoning grasping benchmark dataset derived from GraspNet-1 billion, which includes implicit instructions for object-level and part-level grasping. The dataset contains 64 objects, 109 parts, 1730 reasoning instructions, and around 100 million grasping poses. The model's performance is evaluated on the reasoning grasping dataset and real-world experiments. The results show that the proposed model outperforms existing baselines in both the reasoning grasping benchmark and real-world experiments. The model is able to interpret implicit instructions and accurately generate corresponding grasp poses. The model also maintains the visual reasoning capabilities of the original LLaVA model. In real-world experiments, the model successfully grasps objects and parts based on explicit or implicit instructions. The model's performance is evaluated using three metrics: the correctness of generating special tokens and grasping target names, the accuracy of the output grasp pose, and the success in lifting the object. The results show that the model outperforms the baseline in four distinct scenarios, demonstrating its superior ability in reasoning grasping tasks. The paper also discusses the limitations of the model, including its performance on novel objects and the need for further research to improve its capabilities.This paper introduces a novel task called Reasoning Grasping, where robots must generate grasp poses based on implicit verbal instructions or intentions. The proposed model integrates a multi-modal Large Language Model (LLM) with a vision-based robotic grasping framework, enabling the robot to interpret complex and implicit instructions and accurately predict grasp poses for target objects or specific parts within cluttered environments. The model outputs grasp poses using special tokens [SPT] to indicate the grasping target. The model is trained on a new reasoning grasping benchmark dataset derived from GraspNet-1 billion, which includes implicit instructions for object-level and part-level grasping. The dataset contains 64 objects, 109 parts, 1730 reasoning instructions, and around 100 million grasping poses. The model's performance is evaluated on the reasoning grasping dataset and real-world experiments. The results show that the proposed model outperforms existing baselines in both the reasoning grasping benchmark and real-world experiments. The model is able to interpret implicit instructions and accurately generate corresponding grasp poses. The model also maintains the visual reasoning capabilities of the original LLaVA model. In real-world experiments, the model successfully grasps objects and parts based on explicit or implicit instructions. The model's performance is evaluated using three metrics: the correctness of generating special tokens and grasping target names, the accuracy of the output grasp pose, and the success in lifting the object. The results show that the model outperforms the baseline in four distinct scenarios, demonstrating its superior ability in reasoning grasping tasks. The paper also discusses the limitations of the model, including its performance on novel objects and the need for further research to improve its capabilities.

Reasoning Grasping via Multimodal Large Language Model

25 Apr 2024 | Shiyu Jin, Jinxuan Xu, Yutian Lei, Liangjun Zhang