[slides and audio] AutoRT%3A Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

AutoRT is a system designed to scale up the deployment of operational robots in unseen scenarios with minimal human supervision. It leverages vision-language models (VLMs) for scene understanding and grounding, and large language models (LLMs) for proposing diverse and novel instructions for a fleet of robots. The system guides data collection by tapping into the knowledge of foundation models, enabling effective reasoning about autonomy trade-offs and safety while significantly scaling up data collection for robot learning. AutoRT's core is an LLM acting as a robot orchestrator, prescribing tasks to robots based on user prompts and environmental affordances. The process involves scene description, task proposal, and affordance filtering to determine which tasks to attempt. The system also includes a robot constitution, which defines rules for safe and appropriate robot behavior, ensuring compliance with high-level objectives and constraints. The experimental evaluation demonstrates that AutoRT can propose instructions to over 20 robots across multiple buildings, collecting 77,000 real robot episodes via both teleoperation and autonomous policies. The collected data is shown to be more diverse and can be used to improve state-of-the-art robot learning models. The system also introduces aligning robot behavior to human preferences using prompting and critiquing with a robot constitution. AutoRT addresses challenges such as data diversity, task feasibility, and safety, making it a significant step towards scaling robot data collection and embodying foundation models into robotic systems. However, limitations include the reliance on scripted and learned policies, communication bandwidth issues, and the need for human supervision to ensure safety.AutoRT is a system designed to scale up the deployment of operational robots in unseen scenarios with minimal human supervision. It leverages vision-language models (VLMs) for scene understanding and grounding, and large language models (LLMs) for proposing diverse and novel instructions for a fleet of robots. The system guides data collection by tapping into the knowledge of foundation models, enabling effective reasoning about autonomy trade-offs and safety while significantly scaling up data collection for robot learning. AutoRT's core is an LLM acting as a robot orchestrator, prescribing tasks to robots based on user prompts and environmental affordances. The process involves scene description, task proposal, and affordance filtering to determine which tasks to attempt. The system also includes a robot constitution, which defines rules for safe and appropriate robot behavior, ensuring compliance with high-level objectives and constraints. The experimental evaluation demonstrates that AutoRT can propose instructions to over 20 robots across multiple buildings, collecting 77,000 real robot episodes via both teleoperation and autonomous policies. The collected data is shown to be more diverse and can be used to improve state-of-the-art robot learning models. The system also introduces aligning robot behavior to human preferences using prompting and critiquing with a robot constitution. AutoRT addresses challenges such as data diversity, task feasibility, and safety, making it a significant step towards scaling robot data collection and embodying foundation models into robotic systems. However, limitations include the reliance on scripted and learned policies, communication bandwidth issues, and the need for human supervision to ensure safety.

AUTO RT: EMBODIED FOUNDATION MODELS FOR LARGE SCALE ORCHESTRATION OF ROBOTIC AGENTS