12 Jul 2024 | Michal Zawalski*,1,2, William Chen*1, Karl Pertsch1,3 Oier Mees1, Chelsea Finn*, Sergey Levine1
The paper introduces Embodied Chain-of-Thought Reasoning (ECoT) for Vision-Language-Action (VLA) models, which are used to control robots. ECoT trains VLA models to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting robot actions. This approach addresses the limitation of standard VLA models, which lack the ability to reason iteratively through complex problems. The authors design a scalable pipeline to generate synthetic training data for ECoT on large robot datasets. Experiments show that ECoT increases the success rate of OpenVLA, a state-of-the-art VLA policy, by 28% across challenging generalization tasks without additional robot training data. Additionally, ECoT makes policy failures more interpretable and allows humans to correct policy behavior using natural language feedback. The model also demonstrates the ability to transfer ECoT reasoning to unseen embodiments and tasks. The paper discusses related work, the design of ECoT reasoning steps, and efficient inference strategies for ECoT policies.The paper introduces Embodied Chain-of-Thought Reasoning (ECoT) for Vision-Language-Action (VLA) models, which are used to control robots. ECoT trains VLA models to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting robot actions. This approach addresses the limitation of standard VLA models, which lack the ability to reason iteratively through complex problems. The authors design a scalable pipeline to generate synthetic training data for ECoT on large robot datasets. Experiments show that ECoT increases the success rate of OpenVLA, a state-of-the-art VLA policy, by 28% across challenging generalization tasks without additional robot training data. Additionally, ECoT makes policy failures more interpretable and allows humans to correct policy behavior using natural language feedback. The model also demonstrates the ability to transfer ECoT reasoning to unseen embodiments and tasks. The paper discusses related work, the design of ECoT reasoning steps, and efficient inference strategies for ECoT policies.