13 Jun 2024 | An Dinh Vuong, Minh Nhat Vu, Baoru Huang, Nghia Nguyen, Hieu Le, Thieu Vo, Anh Nguyen
This paper introduces Grasp-Anything++, a large-scale language-driven grasp detection dataset containing 1M images and 10M grasp prompts. The authors propose a diffusion-based method for language-driven grasp detection, which uses a contrastive training objective to improve the denoising process for grasp pose estimation. The dataset is created using foundation models and includes detailed object and part-level annotations, enabling zero-shot grasp detection and serving as a benchmark for future work. The proposed method outperforms existing approaches in both vision-based tasks and real-world robotic experiments. The authors also demonstrate the effectiveness of their method in robotic grasping tasks, showing that it can successfully grasp objects in cluttered environments. The dataset and method are evaluated on various benchmarks, including zero-shot grasp detection, and show strong performance across different scenarios. The paper also discusses the challenges and limitations of the proposed approach, including the lack of depth images for direct robotic applications and the reliance on the ChatGPT API for dataset creation. Overall, the work presents a promising new approach to language-driven grasp detection with significant implications for robotic applications.This paper introduces Grasp-Anything++, a large-scale language-driven grasp detection dataset containing 1M images and 10M grasp prompts. The authors propose a diffusion-based method for language-driven grasp detection, which uses a contrastive training objective to improve the denoising process for grasp pose estimation. The dataset is created using foundation models and includes detailed object and part-level annotations, enabling zero-shot grasp detection and serving as a benchmark for future work. The proposed method outperforms existing approaches in both vision-based tasks and real-world robotic experiments. The authors also demonstrate the effectiveness of their method in robotic grasping tasks, showing that it can successfully grasp objects in cluttered environments. The dataset and method are evaluated on various benchmarks, including zero-shot grasp detection, and show strong performance across different scenarios. The paper also discusses the challenges and limitations of the proposed approach, including the lack of depth images for direct robotic applications and the reliance on the ChatGPT API for dataset creation. Overall, the work presents a promising new approach to language-driven grasp detection with significant implications for robotic applications.