1 Jun 2024 | Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li
The paper introduces MPI (Manipulation by Predicting Interaction), a novel pre-training framework designed to enhance the representation learning for robotic manipulation tasks. MPI diverges from existing methods that rely on contrastive learning, masked signal modeling, or video prediction using random frames. Instead, MPI focuses on predicting transition frames and detecting manipulated objects using keyframes as input, thereby fostering a better understanding of "how-to-interact" and "where-to-interact." The framework consists of two main components: the Prediction Transformer and the Detection Transformer, which work together to predict the transition frame and detect the interaction object, respectively. The authors conduct extensive experiments on various robotic tasks, including real-world robot experiments, Franka Kitchen simulations, Meta-World, and a robotics-related recognition task (referring expression grounding). The results demonstrate that MPI achieves significant improvements over state-of-the-art methods, with performance enhancements ranging from 10% to 64% in different settings. The paper also includes ablation studies and discussions on the limitations and future directions of the proposed approach.The paper introduces MPI (Manipulation by Predicting Interaction), a novel pre-training framework designed to enhance the representation learning for robotic manipulation tasks. MPI diverges from existing methods that rely on contrastive learning, masked signal modeling, or video prediction using random frames. Instead, MPI focuses on predicting transition frames and detecting manipulated objects using keyframes as input, thereby fostering a better understanding of "how-to-interact" and "where-to-interact." The framework consists of two main components: the Prediction Transformer and the Detection Transformer, which work together to predict the transition frame and detect the interaction object, respectively. The authors conduct extensive experiments on various robotic tasks, including real-world robot experiments, Franka Kitchen simulations, Meta-World, and a robotics-related recognition task (referring expression grounding). The results demonstrate that MPI achieves significant improvements over state-of-the-art methods, with performance enhancements ranging from 10% to 64% in different settings. The paper also includes ablation studies and discussions on the limitations and future directions of the proposed approach.