[slides] RT-2%3A Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

The paper "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" by Anthony Brohan et al. explores the integration of vision-language models into robotic control to enhance generalization and enable semantic reasoning. The authors propose a method to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. By expressing robot actions as text tokens, they train the models to output low-level robot actions alongside natural language responses. This approach, referred to as vision-language-action (VLA) models, is demonstrated with an example model called RT-2. Extensive evaluations show that RT-2 achieves significant improvements in generalization to novel objects, scenes, and instructions, and exhibits emergent capabilities such as interpreting commands not present in the training data and performing multi-stage semantic reasoning. The paper also discusses related work, limitations, and future directions, highlighting the potential of VLA models in advancing robotics.The paper "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" by Anthony Brohan et al. explores the integration of vision-language models into robotic control to enhance generalization and enable semantic reasoning. The authors propose a method to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. By expressing robot actions as text tokens, they train the models to output low-level robot actions alongside natural language responses. This approach, referred to as vision-language-action (VLA) models, is demonstrated with an example model called RT-2. Extensive evaluations show that RT-2 achieves significant improvements in generalization to novel objects, scenes, and instructions, and exhibits emergent capabilities such as interpreting commands not present in the training data and performing multi-stage semantic reasoning. The paper also discusses related work, limitations, and future directions, highlighting the potential of VLA models in advancing robotics.