OpenVLA: An Open-Source Vision-Language-Action Model

OpenVLA: An Open-Source Vision-Language-Action Model

5 Sep 2024 | Ted Xiao, Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfield, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn
OpenVLA is an open-source vision-language-action (VLA) model with 7 billion parameters, trained on 970,000 robot demonstrations from the Open X-Embodiment dataset. It sets a new state of the art for generalist robot manipulation policies, outperforming closed models like RT-2-X by 16.5% in task success rate across 29 tasks. OpenVLA supports controlling multiple robots out of the box and can be quickly adapted to new domains via parameter-efficient fine-tuning. The model is fully open-source, with checkpoints and training code available on HuggingFace. OpenVLA is built on a Llama 2 language model combined with a visual encoder that fuses features from DINOv2 and SigLIP. It demonstrates strong results for generalist manipulation, with especially strong generalization in multi-task environments involving multiple objects and strong language grounding. OpenVLA also shows substantial improvement over from-scratch imitation learning methods like Diffusion Policy in multi-task settings with multiple objects. OpenVLA is trained on a diverse dataset of 970,000 robot manipulation trajectories from the Open X-Embodiment dataset. The model is trained using a combination of a vision encoder, projector, and Llama 2 language model backbone. The training process involves discretizing robot actions into discrete tokens and training the model to predict these tokens based on input images and language instructions. OpenVLA is evaluated on multiple robot platforms, including the WidowX and Google robot. It outperforms prior generalist policies like RT-1-X, RT-2-X, and Octo in most tasks, with OpenVLA achieving the highest average performance. OpenVLA also shows strong performance in language conditioning tasks and is able to adapt to new robot setups through parameter-efficient fine-tuning. OpenVLA is also evaluated for memory-efficient inference via quantization, which allows the model to run on consumer-grade GPUs without a significant loss in performance. The model is released as open-source, with code and training resources available for future research and development. OpenVLA provides a foundation for further exploration of vision-language-action models in robotics, with the potential to enable more generalist and adaptable robot policies.OpenVLA is an open-source vision-language-action (VLA) model with 7 billion parameters, trained on 970,000 robot demonstrations from the Open X-Embodiment dataset. It sets a new state of the art for generalist robot manipulation policies, outperforming closed models like RT-2-X by 16.5% in task success rate across 29 tasks. OpenVLA supports controlling multiple robots out of the box and can be quickly adapted to new domains via parameter-efficient fine-tuning. The model is fully open-source, with checkpoints and training code available on HuggingFace. OpenVLA is built on a Llama 2 language model combined with a visual encoder that fuses features from DINOv2 and SigLIP. It demonstrates strong results for generalist manipulation, with especially strong generalization in multi-task environments involving multiple objects and strong language grounding. OpenVLA also shows substantial improvement over from-scratch imitation learning methods like Diffusion Policy in multi-task settings with multiple objects. OpenVLA is trained on a diverse dataset of 970,000 robot manipulation trajectories from the Open X-Embodiment dataset. The model is trained using a combination of a vision encoder, projector, and Llama 2 language model backbone. The training process involves discretizing robot actions into discrete tokens and training the model to predict these tokens based on input images and language instructions. OpenVLA is evaluated on multiple robot platforms, including the WidowX and Google robot. It outperforms prior generalist policies like RT-1-X, RT-2-X, and Octo in most tasks, with OpenVLA achieving the highest average performance. OpenVLA also shows strong performance in language conditioning tasks and is able to adapt to new robot setups through parameter-efficient fine-tuning. OpenVLA is also evaluated for memory-efficient inference via quantization, which allows the model to run on consumer-grade GPUs without a significant loss in performance. The model is released as open-source, with code and training resources available for future research and development. OpenVLA provides a foundation for further exploration of vision-language-action models in robotics, with the potential to enable more generalist and adaptable robot policies.
Reach us at info@study.space
[slides] OpenVLA%3A An Open-Source Vision-Language-Action Model | StudySpace