Understanding Vid2Robot%3A End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Vid2Robot is an end-to-end video-conditioned policy learning framework designed to enable robots to perform tasks based on human demonstrations. The system leverages a unified representation model trained on a large dataset of human video demonstrations and robot trajectories, using cross-attention mechanisms to fuse prompt video features with the robot's current state to generate appropriate actions. To enhance policy performance, Vid2Robot incorporates auxiliary contrastive losses that improve the alignment between human and robot video representations. Evaluations on real-world robots show a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. The model also demonstrates emergent capabilities, such as transferring observed motions from one object to another and performing long-horizon compositions, highlighting its potential for real-world applications. The work addresses the challenges of high-dimensional data, variability in task specifications, and limited training data, making it a significant advancement in the field of visual imitation learning for robotics.Vid2Robot is an end-to-end video-conditioned policy learning framework designed to enable robots to perform tasks based on human demonstrations. The system leverages a unified representation model trained on a large dataset of human video demonstrations and robot trajectories, using cross-attention mechanisms to fuse prompt video features with the robot's current state to generate appropriate actions. To enhance policy performance, Vid2Robot incorporates auxiliary contrastive losses that improve the alignment between human and robot video representations. Evaluations on real-world robots show a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. The model also demonstrates emergent capabilities, such as transferring observed motions from one object to another and performing long-horizon compositions, highlighting its potential for real-world applications. The work addresses the challenges of high-dimensional data, variability in task specifications, and limited training data, making it a significant advancement in the field of visual imitation learning for robotics.

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

19 Mar 2024 | Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi