RoboCodeX is a multimodal code generation framework for robotic behavior synthesis, designed to translate high-level human instructions into precise robotic actions. The framework uses a tree-of-thought structure to decompose instructions into object-centric manipulation units, incorporating physical preferences and safety constraints. It leverages a specialized multimodal reasoning dataset and an iterative self-updating methodology for supervised fine-tuning, enhancing its ability to map conceptual and perceptual understanding into control commands. Extensive experiments show that RoboCodeX achieves state-of-the-art performance in both simulators and real robots across four manipulation tasks and one navigation task. The framework integrates a vision adapter to facilitate multi-scale visual feature integration and bridges cognitive perceptions with robotic planning. RoboCodeX demonstrates strong generalization across different robotic platforms and tasks, with ablation studies confirming the importance of preference prediction and the vision adapter. The model's ability to interpret visual observations and human instructions into precise, robot-specific actions highlights its robustness and adaptability in both simulated and real-world environments. The integration of multimodal tree-of-thought reasoning, specialized datasets, and iterative fine-tuning significantly enhances the model's capacity for complex robotic manipulation. RoboCodeX represents a significant advancement in embodied AI, enabling robots to efficiently adapt to and manipulate their environment with unprecedented sophistication.RoboCodeX is a multimodal code generation framework for robotic behavior synthesis, designed to translate high-level human instructions into precise robotic actions. The framework uses a tree-of-thought structure to decompose instructions into object-centric manipulation units, incorporating physical preferences and safety constraints. It leverages a specialized multimodal reasoning dataset and an iterative self-updating methodology for supervised fine-tuning, enhancing its ability to map conceptual and perceptual understanding into control commands. Extensive experiments show that RoboCodeX achieves state-of-the-art performance in both simulators and real robots across four manipulation tasks and one navigation task. The framework integrates a vision adapter to facilitate multi-scale visual feature integration and bridges cognitive perceptions with robotic planning. RoboCodeX demonstrates strong generalization across different robotic platforms and tasks, with ablation studies confirming the importance of preference prediction and the vision adapter. The model's ability to interpret visual observations and human instructions into precise, robot-specific actions highlights its robustness and adaptability in both simulated and real-world environments. The integration of multimodal tree-of-thought reasoning, specialized datasets, and iterative fine-tuning significantly enhances the model's capacity for complex robotic manipulation. RoboCodeX represents a significant advancement in embodied AI, enabling robots to efficiently adapt to and manipulate their environment with unprecedented sophistication.