[slides] RoboCodeX%3A Multimodal Code Generation for Robotic Behavior Synthesis

**RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis** **Authors:** Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo **Abstract:** Robotic behavior synthesis, the process of understanding multimodal inputs and generating precise physical control for robots, is a critical aspect of Embodied AI. Despite advancements in using large language models (LLMs) for high-level understanding, translating these concepts into detailed robotic actions remains challenging, especially across different scenarios. This paper introduces RoboCodeX, a tree-structured multimodal code generation framework designed for generalized robotic behavior synthesis. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units, incorporating physical preferences such as affordance and safety constraints. It applies code generation to enhance generalization across various robotics platforms. To improve the capability of mapping conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training, and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulated and real robots on four different manipulation tasks and one navigation task. **Introduction:** Embodied AI aims to equip intelligent agents with perception, reasoning, and interaction capabilities in the physical world. However, a central challenge is the generalizability of robotic manipulation frameworks. Previous methods leverage LLMs to propose step-by-step natural language plans, but lack environmental information for context. This paper proposes RoboCodeX, a large vision language model with tree-of-thought reasoning capabilities for robotic code generation. It translates high-level semantic understanding into tailored robotic behaviors by decomposing instructions into object-centric manipulation units and predicting physical constraints, preferential rankings, and target position proposals. The model is trained on a specialized multimodal reasoning dataset and fine-tuned using an iterative self-updating methodology. Extensive experiments show that RoboCodeX outperforms state-of-the-art models in both simulated and real-world environments. **Methods:** RoboCodeX employs a multi-modal tree-of-thought architecture to decompose instructions into code units, expanding them through multi-modal predictions. It uses a specialized dataset and an iterative fine-tuning methodology to enhance its capacity for translating semantics and physical preferences into robot-specific motions. The model integrates a vision adapter to facilitate multi-scale visual feature integration and bridge cognitive perceptions with robotic planning. **Evaluation:** RoboCodeX is evaluated on various robotic manipulation tasks and embodied navigation tasks, demonstrating superior performance compared to baselines. Ablation studies confirm the critical role of preference prediction, the vision adapter**RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis** **Authors:** Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo **Abstract:** Robotic behavior synthesis, the process of understanding multimodal inputs and generating precise physical control for robots, is a critical aspect of Embodied AI. Despite advancements in using large language models (LLMs) for high-level understanding, translating these concepts into detailed robotic actions remains challenging, especially across different scenarios. This paper introduces RoboCodeX, a tree-structured multimodal code generation framework designed for generalized robotic behavior synthesis. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units, incorporating physical preferences such as affordance and safety constraints. It applies code generation to enhance generalization across various robotics platforms. To improve the capability of mapping conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training, and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulated and real robots on four different manipulation tasks and one navigation task. **Introduction:** Embodied AI aims to equip intelligent agents with perception, reasoning, and interaction capabilities in the physical world. However, a central challenge is the generalizability of robotic manipulation frameworks. Previous methods leverage LLMs to propose step-by-step natural language plans, but lack environmental information for context. This paper proposes RoboCodeX, a large vision language model with tree-of-thought reasoning capabilities for robotic code generation. It translates high-level semantic understanding into tailored robotic behaviors by decomposing instructions into object-centric manipulation units and predicting physical constraints, preferential rankings, and target position proposals. The model is trained on a specialized multimodal reasoning dataset and fine-tuned using an iterative self-updating methodology. Extensive experiments show that RoboCodeX outperforms state-of-the-art models in both simulated and real-world environments. **Methods:** RoboCodeX employs a multi-modal tree-of-thought architecture to decompose instructions into code units, expanding them through multi-modal predictions. It uses a specialized dataset and an iterative fine-tuning methodology to enhance its capacity for translating semantics and physical preferences into robot-specific motions. The model integrates a vision adapter to facilitate multi-scale visual feature integration and bridge cognitive perceptions with robotic planning. **Evaluation:** RoboCodeX is evaluated on various robotic manipulation tasks and embodied navigation tasks, demonstrating superior performance compared to baselines. Ablation studies confirm the critical role of preference prediction, the vision adapter

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

25 Feb 2024 | Yao Mu12†, Junting Chen23, Qinglong Zhang2, Shoufa Chen1, Qiaojian Yu4, Chongjian Ge1, Runjian Chen1, Zhixuan Liang1, Mengkang Hu1, Chaofan Tao1, Peize Sun1, Haibao Yu1, Chao Yang2, Wenqi Shao2, Wenhai Wang3, Jifeng Dai2,6, Yu Qiao2, Mingyu Ding1*, Ping Luo12*

25 Feb 2024 | Yao Mu12†, Junting Chen23, Qinglong Zhang2, Shoufa Chen1, Qiaojian Yu4, Chongjian Ge1, Runjian Chen1, Zhixuan Liang1, Mengkang Hu1, Chaofan Tao1, Peize Sun1, Haibao Yu1, Chao Yang2, Wenqi Shao2, Wenhai Wang3, Jifeng Dai2,6, Yu Qiao2, Mingyu Ding1, Ping Luo12