LLARVA is a vision-action instruction-tuned model designed for robotic learning, leveraging structured prompts and 2D visual traces to align vision and action spaces. The model is trained on 8.5M image-visual trace pairs from the Open X-Embodiment dataset and evaluated on 12 tasks in the RLBench simulator and a real Franka Emika Panda 7-DoF robot. LLARVA uses 2D images, which are easier to scale and integrate with existing large multimodal models (LMMs), and predicts intermediate 2D representations (visual traces) to improve alignment between vision and action. The model is trained with instruction tuning, using a language instruction that includes robot type, control mode, task, and proprioceptive information. The model predicts future actions and visual traces, enabling it to generalize across various robotic environments and configurations. Experiments show that LLARVA outperforms existing 2D and 3D models, achieving high success rates on multiple tasks. The model also demonstrates strong performance on real robots, showing adaptability to different control modes and environments. LLARVA's use of 2D visual traces helps in long-horizon tasks by acting as a memory buffer, compensating for limited previous robotic states. The model's architecture includes a language instruction that allows it to predict actions based on previous robotic states, and it is trained with a combination of pre-training and fine-tuning. The model's performance is evaluated on a variety of tasks, showing its effectiveness in robotic learning. LLARVA's approach addresses the challenges of aligning vision and action modalities in robotics, demonstrating the potential of instruction-tuned LMMs in robotic applications.LLARVA is a vision-action instruction-tuned model designed for robotic learning, leveraging structured prompts and 2D visual traces to align vision and action spaces. The model is trained on 8.5M image-visual trace pairs from the Open X-Embodiment dataset and evaluated on 12 tasks in the RLBench simulator and a real Franka Emika Panda 7-DoF robot. LLARVA uses 2D images, which are easier to scale and integrate with existing large multimodal models (LMMs), and predicts intermediate 2D representations (visual traces) to improve alignment between vision and action. The model is trained with instruction tuning, using a language instruction that includes robot type, control mode, task, and proprioceptive information. The model predicts future actions and visual traces, enabling it to generalize across various robotic environments and configurations. Experiments show that LLARVA outperforms existing 2D and 3D models, achieving high success rates on multiple tasks. The model also demonstrates strong performance on real robots, showing adaptability to different control modes and environments. LLARVA's use of 2D visual traces helps in long-horizon tasks by acting as a memory buffer, compensating for limited previous robotic states. The model's architecture includes a language instruction that allows it to predict actions based on previous robotic states, and it is trained with a combination of pre-training and fine-tuning. The model's performance is evaluated on a variety of tasks, showing its effectiveness in robotic learning. LLARVA's approach addresses the challenges of aligning vision and action modalities in robotics, demonstrating the potential of instruction-tuned LMMs in robotic applications.