LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

28 Jun 2024 | Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burget, Mu Cai, Yong Jae Lee, Michael S. Ryoo
LLaRA (Large Language and Robotics Assistant) is a framework that transforms robot action policies into conversations and enhances policy learning with auxiliary data. The framework aims to improve the performance of Vision Language Models (VLMs) in robot manipulation tasks by formulating these tasks as instruction-response pairs. LLaRA generates diverse high-quality robotics instruction data from existing behavior cloning data and fine-tunes a VLM using this data. The framework also introduces auxiliary datasets that complement policy learning in a self-supervised manner. Experiments in simulated and real-world environments demonstrate the effectiveness of LLaRA, showing state-of-the-art performance. Key contributions include formulating robot manipulation tasks as instruction-response pairs, a scalable pipeline for generating diverse instruction tuning data, and identifying and generating auxiliary instruction data. The code, datasets, and pre-trained models are available at <https://github.com/LostXine/LLaRA>.LLaRA (Large Language and Robotics Assistant) is a framework that transforms robot action policies into conversations and enhances policy learning with auxiliary data. The framework aims to improve the performance of Vision Language Models (VLMs) in robot manipulation tasks by formulating these tasks as instruction-response pairs. LLaRA generates diverse high-quality robotics instruction data from existing behavior cloning data and fine-tunes a VLM using this data. The framework also introduces auxiliary datasets that complement policy learning in a self-supervised manner. Experiments in simulated and real-world environments demonstrate the effectiveness of LLaRA, showing state-of-the-art performance. Key contributions include formulating robot manipulation tasks as instruction-response pairs, a scalable pipeline for generating diverse instruction tuning data, and identifying and generating auxiliary instruction data. The code, datasets, and pre-trained models are available at <https://github.com/LostXine/LLaRA>.
Reach us at info@study.space