SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

2024 | Junjie Zhang, Chenjia Bai, Haoran He, Wenke Xia, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li
SAM-E is a novel architecture for robot manipulation that leverages a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. The method uses the Segment Anything Model (SAM) as the foundation model for embodied manipulation, which is pre-trained on a large dataset of images and masks. SAM is used to extract task-relevant features, and parameter-efficient fine-tuning is applied to robot data to enhance understanding of embodied scenarios. A novel multichannel heatmap enables the prediction of action sequences in a single pass, improving execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to baselines and significantly improves generalization in few-shot adaptation to new tasks. The method also incorporates a multi-view transformer to integrate cross-view information and language instructions, enabling comprehensive fusion of input modalities. The architecture includes a multi-channel policy head for action-sequence prediction, which generates heatmaps for action sequences and enables efficient planning and execution. SAM-E outperforms state-of-the-art methods in multi-task manipulation and achieves significant improvements in execution efficiency and few-shot adaptation. The method is evaluated on the RLBench benchmark and shows strong performance in real-world scenarios with a Franka Panda robot. The results highlight the effectiveness of leveraging a visual foundation model and sequence prediction for enhancing generalization and efficiency in 3D manipulation.SAM-E is a novel architecture for robot manipulation that leverages a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. The method uses the Segment Anything Model (SAM) as the foundation model for embodied manipulation, which is pre-trained on a large dataset of images and masks. SAM is used to extract task-relevant features, and parameter-efficient fine-tuning is applied to robot data to enhance understanding of embodied scenarios. A novel multichannel heatmap enables the prediction of action sequences in a single pass, improving execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to baselines and significantly improves generalization in few-shot adaptation to new tasks. The method also incorporates a multi-view transformer to integrate cross-view information and language instructions, enabling comprehensive fusion of input modalities. The architecture includes a multi-channel policy head for action-sequence prediction, which generates heatmaps for action sequences and enables efficient planning and execution. SAM-E outperforms state-of-the-art methods in multi-task manipulation and achieves significant improvements in execution efficiency and few-shot adaptation. The method is evaluated on the RLBench benchmark and shows strong performance in real-world scenarios with a Franka Panda robot. The results highlight the effectiveness of leveraging a visual foundation model and sequence prediction for enhancing generalization and efficiency in 3D manipulation.
Reach us at info@study.space