Learning Manipulation by Predicting Interaction

Learning Manipulation by Predicting Interaction

1 Jun 2024 | Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li
This paper proposes a pre-training framework called MPI (Manipulation by Predicting Interaction) for robotic manipulation tasks. MPI aims to enhance visual representations by learning how to interact with objects and where to interact. The framework uses keyframes representing initial and final states of an interaction, along with language instructions, to predict the transition frame and detect the interaction object. Two main objectives are formulated: predicting the transition frame to understand "how-to-interact" and detecting the interaction object to understand "where-to-interact". These objectives are implemented using Prediction Transformer and Detection Transformer modules. The model is trained on a large-scale egocentric human video dataset, and the results show that MPI outperforms previous state-of-the-art methods in real-world robot platforms and simulation environments, achieving improvements of up to 64% in real-world experiments and 10% in Franka Kitchen. The framework is evaluated on various downstream tasks, including visuomotor control on real-world robots, Franka Kitchen, Meta-World, and referring expression grounding. The results demonstrate that MPI achieves state-of-the-art performance across a wide range of tasks. The paper also discusses the limitations of the current framework, including the reliance on explicit annotations and the need for further research on long-horizon planning and causal reasoning.This paper proposes a pre-training framework called MPI (Manipulation by Predicting Interaction) for robotic manipulation tasks. MPI aims to enhance visual representations by learning how to interact with objects and where to interact. The framework uses keyframes representing initial and final states of an interaction, along with language instructions, to predict the transition frame and detect the interaction object. Two main objectives are formulated: predicting the transition frame to understand "how-to-interact" and detecting the interaction object to understand "where-to-interact". These objectives are implemented using Prediction Transformer and Detection Transformer modules. The model is trained on a large-scale egocentric human video dataset, and the results show that MPI outperforms previous state-of-the-art methods in real-world robot platforms and simulation environments, achieving improvements of up to 64% in real-world experiments and 10% in Franka Kitchen. The framework is evaluated on various downstream tasks, including visuomotor control on real-world robots, Franka Kitchen, Meta-World, and referring expression grounding. The results demonstrate that MPI achieves state-of-the-art performance across a wide range of tasks. The paper also discusses the limitations of the current framework, including the reliance on explicit annotations and the need for further research on long-horizon planning and causal reasoning.
Reach us at info@futurestudyspace.com
[slides and audio] Learning Manipulation by Predicting Interaction