1 Jun 2024 | Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, Debibetta Dwibedi, Dorsa Sadigh
RT-H: Action Hierarchies Using Language
This paper introduces RT-H, a method that uses language to build action hierarchies for robot policies. The goal is to improve performance and enable human intervention in tasks. RT-H predicts language motions, such as "move arm forward" or "rotate arm right", which are then used to predict robot actions. This hierarchy allows the model to learn shared structures across tasks with different descriptions and enables humans to correct language motions to prevent task failure.
RT-H uses a Vision-Language Model (VLM) to predict language motions from task descriptions and visual observations. It then uses these language motions to predict robot actions. The model is trained on a large multi-task dataset and can learn policies that are more robust and flexible by effectively tapping into multi-task datasets. RT-H also enables humans to provide language motion corrections to the robot, which can then be used to improve action predictions.
The paper shows that RT-H outperforms other methods on a wide range of tasks, including "close the pistachio jar". It also demonstrates that RT-H can learn from human corrections and outperform methods that learn from teleoperated interventions. The model is trained using a single VLM that is co-trained with internet-scale data to improve policy learning.
The paper also discusses related work in language-conditioned policies, hierarchical action representations, and interactive imitation learning. It shows that RT-H improves performance by leveraging language motions as an intermediate layer between high-level tasks and low-level actions. This allows the model to learn the shared structure of low-level motions across seemingly disparate tasks.
The paper evaluates RT-H on a diverse multi-task dataset and shows that it outperforms other methods in terms of performance, contextuality, and robustness to out-of-distribution settings. It also demonstrates that RT-H can learn from human corrections and that language motions are a more sample-efficient space to learn corrections than teleoperated actions.
Overall, RT-H provides a new paradigm for flexible policies that can learn from human intervention in language. The model uses language motions to build action hierarchies, enabling better data sharing across diverse multi-task datasets and allowing humans to correct language motions to prevent task failure. The results show that RT-H is effective in learning policies that are more robust and flexible, and that it can learn from human corrections to improve performance.RT-H: Action Hierarchies Using Language
This paper introduces RT-H, a method that uses language to build action hierarchies for robot policies. The goal is to improve performance and enable human intervention in tasks. RT-H predicts language motions, such as "move arm forward" or "rotate arm right", which are then used to predict robot actions. This hierarchy allows the model to learn shared structures across tasks with different descriptions and enables humans to correct language motions to prevent task failure.
RT-H uses a Vision-Language Model (VLM) to predict language motions from task descriptions and visual observations. It then uses these language motions to predict robot actions. The model is trained on a large multi-task dataset and can learn policies that are more robust and flexible by effectively tapping into multi-task datasets. RT-H also enables humans to provide language motion corrections to the robot, which can then be used to improve action predictions.
The paper shows that RT-H outperforms other methods on a wide range of tasks, including "close the pistachio jar". It also demonstrates that RT-H can learn from human corrections and outperform methods that learn from teleoperated interventions. The model is trained using a single VLM that is co-trained with internet-scale data to improve policy learning.
The paper also discusses related work in language-conditioned policies, hierarchical action representations, and interactive imitation learning. It shows that RT-H improves performance by leveraging language motions as an intermediate layer between high-level tasks and low-level actions. This allows the model to learn the shared structure of low-level motions across seemingly disparate tasks.
The paper evaluates RT-H on a diverse multi-task dataset and shows that it outperforms other methods in terms of performance, contextuality, and robustness to out-of-distribution settings. It also demonstrates that RT-H can learn from human corrections and that language motions are a more sample-efficient space to learn corrections than teleoperated actions.
Overall, RT-H provides a new paradigm for flexible policies that can learn from human intervention in language. The model uses language motions to build action hierarchies, enabling better data sharing across diverse multi-task datasets and allowing humans to correct language motions to prevent task failure. The results show that RT-H is effective in learning policies that are more robust and flexible, and that it can learn from human corrections to improve performance.