[slides and audio] SUGAR %3A Pre-training 3D Visual Representations for Robotics

**SUGAR 🌟: Pre-training 3D Visual Representations for Robotics** **Authors:** Shizhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid **Institutions:** Inria, École normale supérieure, CNRS, PSL Research University, Mohamed bin Zayed University of Artificial Intelligence **Project Page:** https://cshizhe.github.io/projects/robot_sugar.html **Abstract:** Learning generalizable visual representations from Internet data has shown promising results for robotics, but existing approaches primarily focus on 2D representations, which are sub-optimal for handling occlusions and accurately localizing objects in complex 3D scenes. To address these limitations, we introduce SUGAR, a novel 3D pre-training framework for robotics. SUGAR captures semantic, geometric, and affordance properties of objects through 3D point clouds. It emphasizes the importance of cluttered scenes in 3D representation learning and automatically constructs a multi-object dataset using cost-free supervision in simulation. SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks: cross-modal knowledge distillation for semantic learning, masked point modeling for geometry understanding, grasping pose synthesis for object affordance, 3D instance segmentation, and referring expression grounding for analyzing cluttered scenes. Experimental results on three robotic-related tasks—zero-shot 3D object recognition, referring expression grounding, and language-driven robotic manipulation—show that SUGAR outperforms state-of-the-art 2D and 3D representations. **Contributions:** - We present SUGAR, a framework with a versatile transformer architecture for 3D point cloud representation learning on cluttered scenes. - We pre-train SUGAR on five tasks to enable learning semantics, geometry, and affordance of objects. - We demonstrate that SUGAR outperforms state-of-the-art models on three robotic-related tasks. **Related Work:** - Visual representation learning for robotics: Approaches rely on in-domain data or leverage large-scale internet data for pre-training. - 3D point cloud understanding: Transformer-based models have gained popularity for processing 3D point clouds. - Robotic manipulation: Focuses on object grasping and language-guided manipulation, with recent work exploring 3D visual representations. **Network Architecture:** - SUGAR consists of a point cloud encoder and a prompt-based decoder, utilizing transformer blocks to generate point embeddings and obtain task-specific embeddings. **Pre-training Tasks:** - Masked point modeling (MPM) for geometry understanding. - Cross-modal learning (CML) for distilling knowledge from image and text models. - Grasping pose synthesis (GPS) for object affordance. - Instance segmentation (INS) for segmenting 3D objects. - Referring expression grounding (REG) for segmenting objects described by natural language sentences. **Evaluation:** - Zero-shot 3**SUGAR 🌟: Pre-training 3D Visual Representations for Robotics** **Authors:** Shizhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid **Institutions:** Inria, École normale supérieure, CNRS, PSL Research University, Mohamed bin Zayed University of Artificial Intelligence **Project Page:** https://cshizhe.github.io/projects/robot_sugar.html **Abstract:** Learning generalizable visual representations from Internet data has shown promising results for robotics, but existing approaches primarily focus on 2D representations, which are sub-optimal for handling occlusions and accurately localizing objects in complex 3D scenes. To address these limitations, we introduce SUGAR, a novel 3D pre-training framework for robotics. SUGAR captures semantic, geometric, and affordance properties of objects through 3D point clouds. It emphasizes the importance of cluttered scenes in 3D representation learning and automatically constructs a multi-object dataset using cost-free supervision in simulation. SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks: cross-modal knowledge distillation for semantic learning, masked point modeling for geometry understanding, grasping pose synthesis for object affordance, 3D instance segmentation, and referring expression grounding for analyzing cluttered scenes. Experimental results on three robotic-related tasks—zero-shot 3D object recognition, referring expression grounding, and language-driven robotic manipulation—show that SUGAR outperforms state-of-the-art 2D and 3D representations. **Contributions:** - We present SUGAR, a framework with a versatile transformer architecture for 3D point cloud representation learning on cluttered scenes. - We pre-train SUGAR on five tasks to enable learning semantics, geometry, and affordance of objects. - We demonstrate that SUGAR outperforms state-of-the-art models on three robotic-related tasks. **Related Work:** - Visual representation learning for robotics: Approaches rely on in-domain data or leverage large-scale internet data for pre-training. - 3D point cloud understanding: Transformer-based models have gained popularity for processing 3D point clouds. - Robotic manipulation: Focuses on object grasping and language-guided manipulation, with recent work exploring 3D visual representations. **Network Architecture:** - SUGAR consists of a point cloud encoder and a prompt-based decoder, utilizing transformer blocks to generate point embeddings and obtain task-specific embeddings. **Pre-training Tasks:** - Masked point modeling (MPM) for geometry understanding. - Cross-modal learning (CML) for distilling knowledge from image and text models. - Grasping pose synthesis (GPS) for object affordance. - Instance segmentation (INS) for segmenting 3D objects. - Referring expression grounding (REG) for segmenting objects described by natural language sentences. **Evaluation:** - Zero-shot 3

SUGAR 🍬: Pre-training 3D Visual Representations for Robotics

1 Apr 2024 | Shizhe Chen†, Ricardo Garcia†, Ivan Laptev*, Cordelia Schmid†