19 Mar 2024 | Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debiddatta Dwibedi
Vid2Robot is an end-to-end video-conditioned policy learning framework for robots that directly produces robot actions based on a video demonstration of a manipulation task and current visual observations. The model uses a unified representation model trained on a large dataset of human video and robot trajectory to recognize task semantics and generate appropriate actions. It leverages cross-attention mechanisms to fuse prompt video features with the robot's current state and generate actions that mimic the observed task. Auxiliary contrastive losses are introduced to enhance alignment between human and robot video representations.
The model is evaluated on real-world robots, showing a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. It also exhibits emergent capabilities such as cross-object motion transfer and long-horizon composition, demonstrating its potential for real-world applications. The model is trained using a combination of action prediction loss, temporal video alignment loss, prompt-robot video contrastive loss, and video-text contrastive loss. It uses a transformer-based policy to encode video task specification and a state-prompt encoder to fuse state and prompt information for action prediction.
The model is tested on various manipulation tasks, showing superior performance compared to a baseline model (BC-Z) in tasks such as placing objects upright, picking objects from drawers, and opening/closing drawers. It also demonstrates cross-object motion transfer, successfully applying actions learned from one object to other objects. The model's performance is evaluated across different settings, including varying lighting conditions, object configurations, and robot embodiments. The model's success rate is higher than BC-Z in most tasks, with significant improvements in tasks involving placing objects into drawers. The model's performance is also compared to other approaches in the literature, showing its effectiveness in learning from video demonstrations and adapting to new tasks. The model's success rate is influenced by factors such as the quality of the prompt video, the presence of distractors, and the robot's ability to estimate its state. The model's performance is further enhanced by using multimodal sensor fusion to improve grasp success rates. The model's results demonstrate the potential of video-conditioned policy learning for real-world robotic applications.Vid2Robot is an end-to-end video-conditioned policy learning framework for robots that directly produces robot actions based on a video demonstration of a manipulation task and current visual observations. The model uses a unified representation model trained on a large dataset of human video and robot trajectory to recognize task semantics and generate appropriate actions. It leverages cross-attention mechanisms to fuse prompt video features with the robot's current state and generate actions that mimic the observed task. Auxiliary contrastive losses are introduced to enhance alignment between human and robot video representations.
The model is evaluated on real-world robots, showing a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. It also exhibits emergent capabilities such as cross-object motion transfer and long-horizon composition, demonstrating its potential for real-world applications. The model is trained using a combination of action prediction loss, temporal video alignment loss, prompt-robot video contrastive loss, and video-text contrastive loss. It uses a transformer-based policy to encode video task specification and a state-prompt encoder to fuse state and prompt information for action prediction.
The model is tested on various manipulation tasks, showing superior performance compared to a baseline model (BC-Z) in tasks such as placing objects upright, picking objects from drawers, and opening/closing drawers. It also demonstrates cross-object motion transfer, successfully applying actions learned from one object to other objects. The model's performance is evaluated across different settings, including varying lighting conditions, object configurations, and robot embodiments. The model's success rate is higher than BC-Z in most tasks, with significant improvements in tasks involving placing objects into drawers. The model's performance is also compared to other approaches in the literature, showing its effectiveness in learning from video demonstrations and adapting to new tasks. The model's success rate is influenced by factors such as the quality of the prompt video, the presence of distractors, and the robot's ability to estimate its state. The model's performance is further enhanced by using multimodal sensor fusion to improve grasp success rates. The model's results demonstrate the potential of video-conditioned policy learning for real-world robotic applications.