The paper introduces a novel task called "Reasoning Grasping," where robots must generate grasp poses based on implicit human instructions. The authors propose an end-to-end model that integrates a multi-modal Large Language Model (LLM) with a vision-based robotic grasping framework. This model can interpret both images and instructions to identify the grasping target and generate accurate grasp poses. The paper also presents the first reasoning grasping benchmark dataset, derived from the GraspNet-1 billion dataset, which includes implicit instructions and object part grasping annotations. The model's performance is evaluated using a combination of text generation loss and grasping prediction loss, and it outperforms existing baselines in both simulated and real-world experiments. The results demonstrate the model's ability to understand implicit instructions and generate precise grasping poses, making it a significant advancement in robotic grasping capabilities.The paper introduces a novel task called "Reasoning Grasping," where robots must generate grasp poses based on implicit human instructions. The authors propose an end-to-end model that integrates a multi-modal Large Language Model (LLM) with a vision-based robotic grasping framework. This model can interpret both images and instructions to identify the grasping target and generate accurate grasp poses. The paper also presents the first reasoning grasping benchmark dataset, derived from the GraspNet-1 billion dataset, which includes implicit instructions and object part grasping annotations. The model's performance is evaluated using a combination of text generation loss and grasping prediction loss, and it outperforms existing baselines in both simulated and real-world experiments. The results demonstrate the model's ability to understand implicit instructions and generate precise grasping poses, making it a significant advancement in robotic grasping capabilities.