RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

2023-8-1 | Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspair Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich
The paper "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" by Anthony Brohan et al. explores the integration of vision-language models into robotic control to enhance generalization and enable semantic reasoning. The authors propose a method to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. By expressing robot actions as text tokens, they train the models to output low-level robot actions alongside natural language responses. This approach, referred to as vision-language-action (VLA) models, is demonstrated with an example model called RT-2. Extensive evaluations show that RT-2 achieves significant improvements in generalization to novel objects, scenes, and instructions, and exhibits emergent capabilities such as interpreting commands not present in the training data and performing multi-stage semantic reasoning. The paper also discusses related work, limitations, and future directions, highlighting the potential of VLA models in advancing robotics.The paper "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" by Anthony Brohan et al. explores the integration of vision-language models into robotic control to enhance generalization and enable semantic reasoning. The authors propose a method to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. By expressing robot actions as text tokens, they train the models to output low-level robot actions alongside natural language responses. This approach, referred to as vision-language-action (VLA) models, is demonstrated with an example model called RT-2. Extensive evaluations show that RT-2 achieves significant improvements in generalization to novel objects, scenes, and instructions, and exhibits emergent capabilities such as interpreting commands not present in the training data and performing multi-stage semantic reasoning. The paper also discusses related work, limitations, and future directions, highlighting the potential of VLA models in advancing robotics.
Reach us at info@study.space