1 Jun 2024 | Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, Dorsa Sadigh
The paper introduces RT-H (Robot Transformer with Action Hierarchies), a method that leverages language to improve the performance and flexibility of robot policies. RT-H builds an action hierarchy using language motions, which are more fine-grained descriptions of low-level robot actions. By predicting these language motions as an intermediate step, RT-H can learn the shared structure across tasks with semantically different descriptions. This approach enables better data sharing and generalization in multi-task datasets. Additionally, humans can provide language motion corrections during execution, which the robot can learn from and improve its performance. Experimental results show that RT-H outperforms existing methods by 15% on a wide range of tasks and demonstrates superior flexibility and contextuality in handling out-of-distribution settings. The paper also discusses the benefits of using language motions for interactive imitation learning, showing that training on language motion corrections is more sample-efficient than teleoperated corrections.The paper introduces RT-H (Robot Transformer with Action Hierarchies), a method that leverages language to improve the performance and flexibility of robot policies. RT-H builds an action hierarchy using language motions, which are more fine-grained descriptions of low-level robot actions. By predicting these language motions as an intermediate step, RT-H can learn the shared structure across tasks with semantically different descriptions. This approach enables better data sharing and generalization in multi-task datasets. Additionally, humans can provide language motion corrections during execution, which the robot can learn from and improve its performance. Experimental results show that RT-H outperforms existing methods by 15% on a wide range of tasks and demonstrates superior flexibility and contextuality in handling out-of-distribution settings. The paper also discusses the benefits of using language motions for interactive imitation learning, showing that training on language motion corrections is more sample-efficient than teleoperated corrections.