2 Jul 2024 | Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Isabel Leal, Sharath Maddineni, Kanishka Rao, Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Ted Xiao, Peng Xu, Steve Xu, Zhuo Xu
AutoRT is a system that leverages existing foundation models to scale the deployment of operational robots in completely unseen scenarios with minimal human supervision. The system uses vision-language models (VLMs) for scene understanding and grounding, and large language models (LLMs) to propose diverse and novel instructions for a fleet of robots. AutoRT enables effective reasoning about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. The system collects 77,000 real robot episodes across multiple buildings and 20+ robots, demonstrating the ability to generate diverse, real-world robot data on new skills in new environments. AutoRT is the first system where LLM-controlled robots are allowed to drive autonomously in real-world settings, propose their own goals, and take actions toward those goals. The system allows 1 human to supervise 3-5 mobile manipulators, and shows that AutoRT can collect highly diverse data, be instructed to collect task-appropriate data, and that such data can be used to improve state-of-the-art robot learning models. AutoRT also introduces aligning robot behavior to human preferences using prompting and critiquing with a robot constitution. The system is evaluated over 7 months, 4 different buildings, and a fleet of over 20 robots, resulting in the collection of 77,000 real-world robotic trials with both teleoperation and autonomous execution. AutoRT is designed to handle any state observed by a robot and generate tasks executable by one of k different collect policies. The system uses a robot constitution to define foundational rules, safety constraints, and the robot's embodiment, and ablates the system design to show its usefulness. AutoRT is a step towards scaling robot data collection to the breadth of foundation models and embodying foundation models into robotic systems. Despite the promise of AutoRT, the current approach has limitations, including reliance on scripted and learned policies, communication bandwidth issues, and the challenge of sparse data. AutoRT also requires some degree of human supervision due to the possibility of unsafe tasks passing affordance filtering. The system is evaluated on task generation, affordance, and model training, showing that AutoRT can generate diverse and safe tasks, and improve the performance of state-of-the-art robot learning models.AutoRT is a system that leverages existing foundation models to scale the deployment of operational robots in completely unseen scenarios with minimal human supervision. The system uses vision-language models (VLMs) for scene understanding and grounding, and large language models (LLMs) to propose diverse and novel instructions for a fleet of robots. AutoRT enables effective reasoning about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. The system collects 77,000 real robot episodes across multiple buildings and 20+ robots, demonstrating the ability to generate diverse, real-world robot data on new skills in new environments. AutoRT is the first system where LLM-controlled robots are allowed to drive autonomously in real-world settings, propose their own goals, and take actions toward those goals. The system allows 1 human to supervise 3-5 mobile manipulators, and shows that AutoRT can collect highly diverse data, be instructed to collect task-appropriate data, and that such data can be used to improve state-of-the-art robot learning models. AutoRT also introduces aligning robot behavior to human preferences using prompting and critiquing with a robot constitution. The system is evaluated over 7 months, 4 different buildings, and a fleet of over 20 robots, resulting in the collection of 77,000 real-world robotic trials with both teleoperation and autonomous execution. AutoRT is designed to handle any state observed by a robot and generate tasks executable by one of k different collect policies. The system uses a robot constitution to define foundational rules, safety constraints, and the robot's embodiment, and ablates the system design to show its usefulness. AutoRT is a step towards scaling robot data collection to the breadth of foundation models and embodying foundation models into robotic systems. Despite the promise of AutoRT, the current approach has limitations, including reliance on scripted and learned policies, communication bandwidth issues, and the challenge of sparse data. AutoRT also requires some degree of human supervision due to the possibility of unsafe tasks passing affordance filtering. The system is evaluated on task generation, affordance, and model training, showing that AutoRT can generate diverse and safe tasks, and improve the performance of state-of-the-art robot learning models.