OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

28 Feb 2024 | Tianyu Zheng1*, Ge Zhang1,2*, Tianhao Shen1*, Xueling Liu1, Bill Yuchen Lin3, Jie Fu1,4, Wenhui Chen1,2, Xiang Yue1,5†
The paper introduces *OpenCodeInterpreter*, an open-source code system designed to generate, execute, and iteratively refine code. *OpenCodeInterpreter* is trained on the Code-Feedback dataset, which features 68K multi-turn interactions between users, code models, and compilers. This dataset integrates execution feedback and human feedback to enable dynamic code refinement. The system is evaluated on benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus, achieving impressive accuracy. Notably, *OpenCodeInterpreter*-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely matching GPT-4's 84.2 (76.2). With synthesized human feedback from GPT-4, *OpenCodeInterpreter* further improves to 91.6 (84.6). The paper also details the creation of the Code-Feedback dataset, the experimental setup, and case studies demonstrating *OpenCodeInterpreter*'s practical applications. The results highlight *OpenCodeInterpreter*'s superior performance in code generation and refinement, bridging the gap between open-source models and proprietary systems like GPT-4 Code Interpreter.The paper introduces *OpenCodeInterpreter*, an open-source code system designed to generate, execute, and iteratively refine code. *OpenCodeInterpreter* is trained on the Code-Feedback dataset, which features 68K multi-turn interactions between users, code models, and compilers. This dataset integrates execution feedback and human feedback to enable dynamic code refinement. The system is evaluated on benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus, achieving impressive accuracy. Notably, *OpenCodeInterpreter*-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely matching GPT-4's 84.2 (76.2). With synthesized human feedback from GPT-4, *OpenCodeInterpreter* further improves to 91.6 (84.6). The paper also details the creation of the Code-Feedback dataset, the experimental setup, and case studies demonstrating *OpenCodeInterpreter*'s practical applications. The results highlight *OpenCodeInterpreter*'s superior performance in code generation and refinement, bridging the gap between open-source models and proprietary systems like GPT-4 Code Interpreter.
Reach us at info@study.space