Robotic Control via Embodied Chain-of-Thought Reasoning

Robotic Control via Embodied Chain-of-Thought Reasoning

12 Jul 2024 | Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, Sergey Levine
This paper introduces Embodied Chain-of-Thought Reasoning (ECoT) for vision-language-action (VLA) policies, which enables robotic policies to improve performance by reasoning about tasks before acting. ECoT trains VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features like object bounding boxes and end effector positions before predicting the robot action. The authors design a scalable pipeline for generating synthetic training data for ECoT on large robot datasets. They demonstrate that ECoT increases the absolute success rate of OpenVLA, the current strongest open-source VLA policy, by 28% across challenging generalization tasks without any additional robot training data. Additionally, ECoT makes it easier for humans to interpret a policy’s failures and correct its behavior interactively using natural language. Finally, they show that their model learns to transfer ECoT reasonings to unseen embodiments and tasks. The paper discusses the limitations of learned robot control policies, which often struggle to generalize outside their training data. Recent works on vision-language-action models (VLAs) have shown that using large, internet pre-trained vision-language models as the backbone of learned robot policies can substantially improve their robustness and generalization ability. However, the most exciting capabilities of large vision-language models in other domains is their ability to reason iteratively through complex problems. The authors hypothesize that they can similarly boost VLA performance by training them to textually reason about their plan, environment, and motions, thereby allowing them to produce more accurate and robust robot actions. The authors propose ECoT for VLAs, in which they train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features like object bounding boxes and end effector positions before predicting the robot action. They design a scalable pipeline for generating synthetic training data for ECoT on large robot datasets. They demonstrate that ECoT increases the absolute success rate of OpenVLA, the current strongest open-source VLA policy, by 28% across challenging generalization tasks without any additional robot training data. Additionally, ECoT makes it easier for humans to interpret a policy’s failures and correct its behavior interactively using natural language. Finally, they show that their model learns to transfer ECoT reasonings to unseen embodiments and tasks.This paper introduces Embodied Chain-of-Thought Reasoning (ECoT) for vision-language-action (VLA) policies, which enables robotic policies to improve performance by reasoning about tasks before acting. ECoT trains VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features like object bounding boxes and end effector positions before predicting the robot action. The authors design a scalable pipeline for generating synthetic training data for ECoT on large robot datasets. They demonstrate that ECoT increases the absolute success rate of OpenVLA, the current strongest open-source VLA policy, by 28% across challenging generalization tasks without any additional robot training data. Additionally, ECoT makes it easier for humans to interpret a policy’s failures and correct its behavior interactively using natural language. Finally, they show that their model learns to transfer ECoT reasonings to unseen embodiments and tasks. The paper discusses the limitations of learned robot control policies, which often struggle to generalize outside their training data. Recent works on vision-language-action models (VLAs) have shown that using large, internet pre-trained vision-language models as the backbone of learned robot policies can substantially improve their robustness and generalization ability. However, the most exciting capabilities of large vision-language models in other domains is their ability to reason iteratively through complex problems. The authors hypothesize that they can similarly boost VLA performance by training them to textually reason about their plan, environment, and motions, thereby allowing them to produce more accurate and robust robot actions. The authors propose ECoT for VLAs, in which they train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features like object bounding boxes and end effector positions before predicting the robot action. They design a scalable pipeline for generating synthetic training data for ECoT on large robot datasets. They demonstrate that ECoT increases the absolute success rate of OpenVLA, the current strongest open-source VLA policy, by 28% across challenging generalization tasks without any additional robot training data. Additionally, ECoT makes it easier for humans to interpret a policy’s failures and correct its behavior interactively using natural language. Finally, they show that their model learns to transfer ECoT reasonings to unseen embodiments and tasks.
Reach us at info@study.space
[slides] Robotic Control via Embodied Chain-of-Thought Reasoning | StudySpace