LLARVA is a novel instruction-tuned Large Multimodal Model (LMM) designed for robotic applications. It leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. The model also incorporates 2-D visual traces to align vision and action spaces, enhancing the model's ability to predict robot actions. LLARVA is trained using 8.5M image-visual trace pairs from the Open X-Embodiment dataset and evaluated on 12 tasks in the RLBench simulator and a physical Franka Emika Panda 7-DoF robot. The experiments demonstrate strong performance, outperforming several contemporary baselines and showing good generalization across different robot environments and configurations. The model's effectiveness is further validated through ablation studies and real-world robot evaluations, highlighting its adaptability and efficiency in various robotic applications.LLARVA is a novel instruction-tuned Large Multimodal Model (LMM) designed for robotic applications. It leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. The model also incorporates 2-D visual traces to align vision and action spaces, enhancing the model's ability to predict robot actions. LLARVA is trained using 8.5M image-visual trace pairs from the Open X-Embodiment dataset and evaluated on 12 tasks in the RLBench simulator and a physical Franka Emika Panda 7-DoF robot. The experiments demonstrate strong performance, outperforming several contemporary baselines and showing good generalization across different robot environments and configurations. The model's effectiveness is further validated through ablation studies and real-world robot evaluations, highlighting its adaptability and efficiency in various robotic applications.