1 Feb 2024 | Junjie Wen, Yichen Zhu, Minjie Zhu, Jiming Li, Zhiyuan Xu, Zhengping Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, and Jian Tang
This paper introduces the Object-Centric Instruction Augmentation (OCI) framework to enhance robotic manipulation by incorporating object positions into language instructions. The framework utilizes a Multimodal Large Language Model (MLLM) to encode object locations into natural language, enabling the policy network to better understand and execute tasks. The MLLM is fine-tuned to recognize object positions and integrate them into instructions, providing both absolute and relative position cues. A feature reuse mechanism is also introduced to leverage pre-trained MLLM features for policy networks, improving performance without high computational costs. The framework is tested on simulated and real-world robotic tasks, demonstrating superior performance compared to traditional language instructions. The results show that incorporating object positions significantly enhances the success rate of manipulation tasks. The paper also highlights the importance of using pretrained models in the "where" dimension for effective instruction augmentation. The proposed method enables more accurate and efficient robotic manipulation by providing clear positional information through natural language, leading to better task execution and generalization. The framework is implemented with a focus on both simulation and real-world environments, showing its effectiveness in various robotic manipulation scenarios.This paper introduces the Object-Centric Instruction Augmentation (OCI) framework to enhance robotic manipulation by incorporating object positions into language instructions. The framework utilizes a Multimodal Large Language Model (MLLM) to encode object locations into natural language, enabling the policy network to better understand and execute tasks. The MLLM is fine-tuned to recognize object positions and integrate them into instructions, providing both absolute and relative position cues. A feature reuse mechanism is also introduced to leverage pre-trained MLLM features for policy networks, improving performance without high computational costs. The framework is tested on simulated and real-world robotic tasks, demonstrating superior performance compared to traditional language instructions. The results show that incorporating object positions significantly enhances the success rate of manipulation tasks. The paper also highlights the importance of using pretrained models in the "where" dimension for effective instruction augmentation. The proposed method enables more accurate and efficient robotic manipulation by providing clear positional information through natural language, leading to better task execution and generalization. The framework is implemented with a focus on both simulation and real-world environments, showing its effectiveness in various robotic manipulation scenarios.