[slides and audio] LLARVA%3A Vision-Action Instruction Tuning Enhances Robot Learning

LLARVA is a novel instruction-tuned Large Multimodal Model (LMM) designed for robotic applications. It leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. The model also incorporates 2-D visual traces to align vision and action spaces, enhancing the model's ability to predict robot actions. LLARVA is trained using 8.5M image-visual trace pairs from the Open X-Embodiment dataset and evaluated on 12 tasks in the RLBench simulator and a physical Franka Emika Panda 7-DoF robot. The experiments demonstrate strong performance, outperforming several contemporary baselines and showing good generalization across different robot environments and configurations. The model's effectiveness is further validated through ablation studies and real-world robot evaluations, highlighting its adaptability and efficiency in various robotic applications.LLARVA is a novel instruction-tuned Large Multimodal Model (LMM) designed for robotic applications. It leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. The model also incorporates 2-D visual traces to align vision and action spaces, enhancing the model's ability to predict robot actions. LLARVA is trained using 8.5M image-visual trace pairs from the Open X-Embodiment dataset and evaluated on 12 tasks in the RLBench simulator and a physical Franka Emika Panda 7-DoF robot. The experiments demonstrate strong performance, outperforming several contemporary baselines and showing good generalization across different robot environments and configurations. The model's effectiveness is further validated through ablation studies and real-world robot evaluations, highlighting its adaptability and efficiency in various robotic applications.

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

17 Jun 2024 | Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenon, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

17 Jun 2024 | Dantong Niu*, Yuvan Sharma*, Giscard Biamby, Jerome Quenon, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig

17 Jun 2024 | Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenon, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig