SUGAR: Pre-training 3D Visual Representations for Robotics

SUGAR: Pre-training 3D Visual Representations for Robotics

1 Apr 2024 | Shizhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid
SUGAR is a novel 3D pre-training framework for robotics that learns semantic, geometric, and affordance properties of objects through 3D point clouds. It addresses the limitations of existing 2D pre-training methods by focusing on cluttered scenes and multi-object understanding. SUGAR employs a transformer-based model to jointly address five pre-training tasks: masked point modeling, cross-modal knowledge distillation, grasping pose synthesis, 3D instance segmentation, and referring expression grounding. The framework automatically constructs a multi-object dataset using simulation-based supervision. SUGAR outperforms state-of-the-art 2D and 3D representations on three robotic-related tasks: zero-shot 3D object recognition, referring expression grounding, and language-driven robotic manipulation. The results show that SUGAR's 3D representation significantly improves performance in these tasks, demonstrating the importance of 3D pre-training in cluttered scenes and learning object affordances for robotics. The contributions of this work include the introduction of SUGAR, pre-training on five tasks to learn object properties, and experimental validation of SUGAR's superiority on three robotic-related tasks.SUGAR is a novel 3D pre-training framework for robotics that learns semantic, geometric, and affordance properties of objects through 3D point clouds. It addresses the limitations of existing 2D pre-training methods by focusing on cluttered scenes and multi-object understanding. SUGAR employs a transformer-based model to jointly address five pre-training tasks: masked point modeling, cross-modal knowledge distillation, grasping pose synthesis, 3D instance segmentation, and referring expression grounding. The framework automatically constructs a multi-object dataset using simulation-based supervision. SUGAR outperforms state-of-the-art 2D and 3D representations on three robotic-related tasks: zero-shot 3D object recognition, referring expression grounding, and language-driven robotic manipulation. The results show that SUGAR's 3D representation significantly improves performance in these tasks, demonstrating the importance of 3D pre-training in cluttered scenes and learning object affordances for robotics. The contributions of this work include the introduction of SUGAR, pre-training on five tasks to learn object properties, and experimental validation of SUGAR's superiority on three robotic-related tasks.
Reach us at info@study.space
[slides and audio] SUGAR %3A Pre-training 3D Visual Representations for Robotics