4 Apr 2024 | Kailin Li, Jingbo Wang, Lixin Yang, Cewu Lu, Bo Dai
The paper introduces SEMGRASP, a novel method for generating semantic grasps from language instructions. SEMGRASP incorporates semantic information into the grasp representation by dividing it into three components: orientation, manner, and refinement. A discrete representation is used to align the grasp space with the semantic space, enabling the generation of grasp postures that align with linguistic intentions. The method leverages a Multimodal Large Language Model (MLLM) to integrate object, grasp, and language information within a unified semantic space. To train SEMGRASP, a large-scale dataset named CAPGRASP is compiled, featuring detailed captions and diverse grasps. Experimental results demonstrate that SEMGRASP effectively generates natural human grasps, showing its potential value in applications such as AR/VR and embodied robotics. The code, models, and dataset are publicly available.The paper introduces SEMGRASP, a novel method for generating semantic grasps from language instructions. SEMGRASP incorporates semantic information into the grasp representation by dividing it into three components: orientation, manner, and refinement. A discrete representation is used to align the grasp space with the semantic space, enabling the generation of grasp postures that align with linguistic intentions. The method leverages a Multimodal Large Language Model (MLLM) to integrate object, grasp, and language information within a unified semantic space. To train SEMGRASP, a large-scale dataset named CAPGRASP is compiled, featuring detailed captions and diverse grasps. Experimental results demonstrate that SEMGRASP effectively generates natural human grasps, showing its potential value in applications such as AR/VR and embodied robotics. The code, models, and dataset are publicly available.