Understanding OpenVLA%3A An Open-Source Vision-Language-Action Model

OpenVLA is a 7B-parameter open-source vision-language-action (VLA) model trained on 970k robot episodes from the Open X-Embodiment dataset. It sets a new state-of-the-art for generalist robot manipulation policies, outperforming closed models like RT-X (55B parameters) by 16.5% in task success rate across 29 tasks and multiple robot embodiments. OpenVLA supports controlling multiple robots out-of-the-box and can be quickly adapted to new robot domains via parameter-efficient fine-tuning. The model and its training pipeline are fully open-source, and models can be downloaded and fine-tuned from HuggingFace. The paper also explores efficient fine-tuning strategies, demonstrating that OpenVLA can be fine-tuned on consumer GPUs using modern low-rank adaptation methods and quantization techniques without compromising performance. The release of model checkpoints, fine-tuning notebooks, and the PyTorch codebase aims to enable future research on VLA training, data mixtures, objectives, and inference.OpenVLA is a 7B-parameter open-source vision-language-action (VLA) model trained on 970k robot episodes from the Open X-Embodiment dataset. It sets a new state-of-the-art for generalist robot manipulation policies, outperforming closed models like RT-X (55B parameters) by 16.5% in task success rate across 29 tasks and multiple robot embodiments. OpenVLA supports controlling multiple robots out-of-the-box and can be quickly adapted to new robot domains via parameter-efficient fine-tuning. The model and its training pipeline are fully open-source, and models can be downloaded and fine-tuned from HuggingFace. The paper also explores efficient fine-tuning strategies, demonstrating that OpenVLA can be fine-tuned on consumer GPUs using modern low-rank adaptation methods and quantization techniques without compromising performance. The release of model checkpoints, fine-tuning notebooks, and the PyTorch codebase aims to enable future research on VLA training, data mixtures, objectives, and inference.

OpenVLA: An Open-Source Vision-Language-Action Model

5 Sep 2024 | Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ashwin Balakrishna, Suraj Nair, Rafael Rafaioy, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Sergey Levine, Percy Liang, Chelsea Finn