2023-8-1 | Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricu, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich
This paper presents RT-2, a vision-language-action (VLA) model that integrates large-scale vision-language models (VLMs) with robotic control to improve generalization and enable emergent semantic reasoning. The model is trained on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. By expressing robot actions as text tokens and incorporating them into the training data, the model learns to map observations to actions. This approach allows the model to benefit from the large-scale pretraining on language and vision-language data, leading to improved performance in novel object recognition, command interpretation, and reasoning tasks. The model, called RT-2, is evaluated on 6,000 trials and demonstrates significant improvements in generalization and emergent capabilities, including the ability to perform multi-stage semantic reasoning. The paper also discusses the limitations of the approach, such as the need for more diverse data to acquire new skills and the computational cost of large VLA models. The results show that RT-2 outperforms existing baselines in terms of generalization and emergent capabilities, highlighting the potential of VLA models in robotics.This paper presents RT-2, a vision-language-action (VLA) model that integrates large-scale vision-language models (VLMs) with robotic control to improve generalization and enable emergent semantic reasoning. The model is trained on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. By expressing robot actions as text tokens and incorporating them into the training data, the model learns to map observations to actions. This approach allows the model to benefit from the large-scale pretraining on language and vision-language data, leading to improved performance in novel object recognition, command interpretation, and reasoning tasks. The model, called RT-2, is evaluated on 6,000 trials and demonstrates significant improvements in generalization and emergent capabilities, including the ability to perform multi-stage semantic reasoning. The paper also discusses the limitations of the approach, such as the need for more diverse data to acquire new skills and the computational cost of large VLA models. The results show that RT-2 outperforms existing baselines in terms of generalization and emergent capabilities, highlighting the potential of VLA models in robotics.