Understanding RoboUniView%3A Visual-Language Model with Unified View Representation for Robotic Manipulaiton

RoboUniView is a novel Visual-Language Model (VLM) designed for robotic manipulation, aiming to enhance the model's ability to generalize to new objects and instructions. The paper addresses the challenge of performance disparities across different robotic platforms due to variations in camera specifications and mounting positions. RoboUniView decouples visual feature extraction from action learning, first learning a unified view representation from multi-perspective views through pre-training on accessible data, and then deriving actions from this representation to control robotic manipulation. This approach ensures the representation accurately reflects the physical world and is not constrained by specific camera parameters. The model achieves state-of-the-art performance on the CALVIN benchmark, improving success rates in the $D \rightarrow D$ setting from 93.0% to 96.2% and in the $ABC \rightarrow D$ setting from 92.2% to 94.2%. It also demonstrates strong adaptability and flexibility, maintaining high performance under unseen camera parameters, using multiple datasets with varying camera parameters, and performing joint cross-task learning across datasets. The paper introduces UVFormer, a plugin inspired by BEVFormer, which integrates into any multi-modal model to transform multi-perspective views into a unified view representation. The model is pre-trained on 3D occupancy tasks, using simple RGB-D images without expensive manual annotations. The action learning phase directly outputs robotic actions from the unified view representation, leveraging pre-trained VLMs. Experiments show that RoboUniView outperforms existing methods in terms of performance and generalization, with significant improvements in success rates and sequence lengths. The model also demonstrates robust generalization to zero-shot unseen camera parameters, training on multiple datasets with different camera parameters, and joint cross-task learning across datasets. Limitations include the dependency on precise camera calibration and the lack of real-robot data for deployment. Future work aims to address these limitations and deploy RoboUniView on real-world robots.RoboUniView is a novel Visual-Language Model (VLM) designed for robotic manipulation, aiming to enhance the model's ability to generalize to new objects and instructions. The paper addresses the challenge of performance disparities across different robotic platforms due to variations in camera specifications and mounting positions. RoboUniView decouples visual feature extraction from action learning, first learning a unified view representation from multi-perspective views through pre-training on accessible data, and then deriving actions from this representation to control robotic manipulation. This approach ensures the representation accurately reflects the physical world and is not constrained by specific camera parameters. The model achieves state-of-the-art performance on the CALVIN benchmark, improving success rates in the $D \rightarrow D$ setting from 93.0% to 96.2% and in the $ABC \rightarrow D$ setting from 92.2% to 94.2%. It also demonstrates strong adaptability and flexibility, maintaining high performance under unseen camera parameters, using multiple datasets with varying camera parameters, and performing joint cross-task learning across datasets. The paper introduces UVFormer, a plugin inspired by BEVFormer, which integrates into any multi-modal model to transform multi-perspective views into a unified view representation. The model is pre-trained on 3D occupancy tasks, using simple RGB-D images without expensive manual annotations. The action learning phase directly outputs robotic actions from the unified view representation, leveraging pre-trained VLMs. Experiments show that RoboUniView outperforms existing methods in terms of performance and generalization, with significant improvements in success rates and sequence lengths. The model also demonstrates robust generalization to zero-shot unseen camera parameters, training on multiple datasets with different camera parameters, and joint cross-task learning across datasets. Limitations include the dependency on precise camera calibration and the lack of real-robot data for deployment. Future work aims to address these limitations and deploy RoboUniView on real-world robots.

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

12 Jul 2024 | Fanfan Liu, Feng Yan, Liming Zheng, Chengjian Feng, Yiyang Huang, Lin Ma