RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

12 Jul 2024 | Fanfan Liu, Feng Yan, Liming Zheng, Chengjian Feng, Yiyang Huang, Lin Ma
RoboUniView is a novel visual-language model designed for robotic manipulation that introduces a unified view representation to enhance performance and generalization across different robotic platforms. The model decouples visual feature extraction from action learning, enabling it to learn a unified view representation from multi-perspective views through pre-training on readily available data. This unified view representation accurately reflects the physical world and is not constrained by the robotic platform's camera parameters. RoboUniView achieves state-of-the-art performance on the CALVIN benchmark, improving success rates in both the D→D and ABC→D settings. It demonstrates strong adaptability and flexibility, maintaining high performance under unseen camera parameters and across multiple datasets. The model also supports joint cross-task learning across datasets. The key contributions include proposing a visual-language model with a unified view representation for robotic manipulation, an effective pre-training method for obtaining a unified view representation, and extensive experiments showing the model's superior performance. RoboUniView is trained on a large dataset of RGB-D images and camera parameters, and then fine-tuned on robot data to learn multi-task visual robot manipulation. The model's architecture includes a Vision Encoder, Feature Fusion Decoder, and Policy Head, with the Vision Encoder using a simplified structure called UVFormer to transform multi-camera perspective view features into a unified view representation. The model's training involves pre-training on a 3D occupancy task and fine-tuning on multi-task grasping data. The model's performance is evaluated on the CALVIN dataset, where it achieves significant improvements in success rates compared to existing methods. RoboUniView is capable of generalizing to unseen camera parameters and demonstrates strong adaptability and flexibility. The model's success is attributed to its unified view representation, which consistently maintains stable output. The paper also discusses limitations and future work, including the need for real-world deployment and further research on generalization capabilities.RoboUniView is a novel visual-language model designed for robotic manipulation that introduces a unified view representation to enhance performance and generalization across different robotic platforms. The model decouples visual feature extraction from action learning, enabling it to learn a unified view representation from multi-perspective views through pre-training on readily available data. This unified view representation accurately reflects the physical world and is not constrained by the robotic platform's camera parameters. RoboUniView achieves state-of-the-art performance on the CALVIN benchmark, improving success rates in both the D→D and ABC→D settings. It demonstrates strong adaptability and flexibility, maintaining high performance under unseen camera parameters and across multiple datasets. The model also supports joint cross-task learning across datasets. The key contributions include proposing a visual-language model with a unified view representation for robotic manipulation, an effective pre-training method for obtaining a unified view representation, and extensive experiments showing the model's superior performance. RoboUniView is trained on a large dataset of RGB-D images and camera parameters, and then fine-tuned on robot data to learn multi-task visual robot manipulation. The model's architecture includes a Vision Encoder, Feature Fusion Decoder, and Policy Head, with the Vision Encoder using a simplified structure called UVFormer to transform multi-camera perspective view features into a unified view representation. The model's training involves pre-training on a 3D occupancy task and fine-tuning on multi-task grasping data. The model's performance is evaluated on the CALVIN dataset, where it achieves significant improvements in success rates compared to existing methods. RoboUniView is capable of generalizing to unseen camera parameters and demonstrates strong adaptability and flexibility. The model's success is attributed to its unified view representation, which consistently maintains stable output. The paper also discusses limitations and future work, including the need for real-world deployment and further research on generalization capabilities.
Reach us at info@study.space
Understanding RoboUniView%3A Visual-Language Model with Unified View Representation for Robotic Manipulaiton