26 May 2024 | Dibya Ghosh*, Homer Walke*, Karl Pertsch*, Kevin Black*, Oier Mees*, Sudeep Dasari, Joey Hejman, Tobias Kreiman, Ria Doshi, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, Sergey Levine
Octo is an open-source, generalist robot policy that can be quickly fine-tuned to new observation and action spaces. It is a transformer-based policy pretrained on 800k diverse robot episodes from the Open X-Embodiment dataset. Octo supports flexible task and observation definitions and can be effectively fine-tuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, Octo serves as a versatile policy initialization that can be effectively fine-tuned to new observation and action spaces. The model is designed to be flexible and scalable, supporting a variety of commonly used robots, sensor configurations, and actions while providing a generic and scalable recipe that can be trained on large amounts of data. It also supports natural language instructions, goal images, observation histories, and multi-modal, chunked action prediction via diffusion decoding. Furthermore, Octo is designed to enable efficient fine-tuning to new robot setups, including robots with different action spaces and different combinations of cameras and proprioceptive information. The model is fully open-source, including the training pipeline, model checkpoints, and data. Octo is trained on the largest robot manipulation dataset to date: 800k robot demonstrations from the Open X-Embodiment dataset. Octo is the first GRP that can be effectively fine-tuned to new observations and action spaces and the first generalist robot manipulation policy that is fully open-source. The model is evaluated across 9 real robot setups across 4 institutions, demonstrating its ability to control multiple robot embodiments and solve language and goal tasks out-of-the-box. Octo also performs well in data-efficient fine-tuning to new environments and tasks, including with new observations, action spaces, and robot embodiments. The model is compared to other generalist robot policies and found to outperform them in several aspects. Octo's design is inspired by recent advances in robot imitation learning and scalable transformer training, including the use of denoising diffusion objectives for action decoding, the prediction of "action chunks," and model layouts and learning rate schedules inspired by the literature on scalable vision transformer training. The model is trained on a mixture of 25 datasets from the Open X-Embodiment Dataset, a diverse collection of robot learning datasets. The training mixture includes demonstration data of a variety of tasks from several robot embodiments and scenes. The datasets are heterogeneous not just in terms of the robot type, but also in the sensors and labels. The training objective is a conditional diffusion decoding head to predict continuous, multi-modal action distributions. The model is trained using a diffusion objective, which outperforms policies trained with MSE action heads or discretized action distributions. The model is evaluated across several tasks, including picking and placing, wiping a table with a cloth, and opening and closing drawers. The results show that Octo outperforms other models in several aspects, including success rates and performance on new environments and tasksOcto is an open-source, generalist robot policy that can be quickly fine-tuned to new observation and action spaces. It is a transformer-based policy pretrained on 800k diverse robot episodes from the Open X-Embodiment dataset. Octo supports flexible task and observation definitions and can be effectively fine-tuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, Octo serves as a versatile policy initialization that can be effectively fine-tuned to new observation and action spaces. The model is designed to be flexible and scalable, supporting a variety of commonly used robots, sensor configurations, and actions while providing a generic and scalable recipe that can be trained on large amounts of data. It also supports natural language instructions, goal images, observation histories, and multi-modal, chunked action prediction via diffusion decoding. Furthermore, Octo is designed to enable efficient fine-tuning to new robot setups, including robots with different action spaces and different combinations of cameras and proprioceptive information. The model is fully open-source, including the training pipeline, model checkpoints, and data. Octo is trained on the largest robot manipulation dataset to date: 800k robot demonstrations from the Open X-Embodiment dataset. Octo is the first GRP that can be effectively fine-tuned to new observations and action spaces and the first generalist robot manipulation policy that is fully open-source. The model is evaluated across 9 real robot setups across 4 institutions, demonstrating its ability to control multiple robot embodiments and solve language and goal tasks out-of-the-box. Octo also performs well in data-efficient fine-tuning to new environments and tasks, including with new observations, action spaces, and robot embodiments. The model is compared to other generalist robot policies and found to outperform them in several aspects. Octo's design is inspired by recent advances in robot imitation learning and scalable transformer training, including the use of denoising diffusion objectives for action decoding, the prediction of "action chunks," and model layouts and learning rate schedules inspired by the literature on scalable vision transformer training. The model is trained on a mixture of 25 datasets from the Open X-Embodiment Dataset, a diverse collection of robot learning datasets. The training mixture includes demonstration data of a variety of tasks from several robot embodiments and scenes. The datasets are heterogeneous not just in terms of the robot type, but also in the sensors and labels. The training objective is a conditional diffusion decoding head to predict continuous, multi-modal action distributions. The model is trained using a diffusion objective, which outperforms policies trained with MSE action heads or discretized action distributions. The model is evaluated across several tasks, including picking and placing, wiping a table with a cloth, and opening and closing drawers. The results show that Octo outperforms other models in several aspects, including success rates and performance on new environments and tasks