SEMCODER: Training Code Language Models with Comprehensive Semantics

SEMCODER: Training Code Language Models with Comprehensive Semantics

3 Jun 2024 | Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, Baishakhi Ray
SEMCODER is a code language model trained with comprehensive semantics to enhance understanding and reasoning about program execution. The model is trained on a synthetic dataset, PYX, which includes executable code samples with functional descriptions and execution traces. This dataset is curated to capture various aspects of program semantics, including high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior. The model is trained using a novel strategy called Monologue Reasoning, where code execution is described step-by-step in natural language, mimicking human verbal debugging. This approach allows the model to better understand and reason about code execution, leading to improved performance in code generation and execution reasoning tasks. SEMCODER achieves 81.1% on HumanEval and 54.5% on CRUXEval-I, outperforming GPT-3.5-turbo. The model also demonstrates strong capabilities in debugging and self-refinement, leveraging execution traces and debugging rationales to iteratively improve code generation. The research highlights the potential of integrating deep semantic understanding into code language models to enhance their effectiveness in complex programming tasks. The dataset and model are publicly released for further research and development.SEMCODER is a code language model trained with comprehensive semantics to enhance understanding and reasoning about program execution. The model is trained on a synthetic dataset, PYX, which includes executable code samples with functional descriptions and execution traces. This dataset is curated to capture various aspects of program semantics, including high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior. The model is trained using a novel strategy called Monologue Reasoning, where code execution is described step-by-step in natural language, mimicking human verbal debugging. This approach allows the model to better understand and reason about code execution, leading to improved performance in code generation and execution reasoning tasks. SEMCODER achieves 81.1% on HumanEval and 54.5% on CRUXEval-I, outperforming GPT-3.5-turbo. The model also demonstrates strong capabilities in debugging and self-refinement, leveraging execution traces and debugging rationales to iteratively improve code generation. The research highlights the potential of integrating deep semantic understanding into code language models to enhance their effectiveness in complex programming tasks. The dataset and model are publicly released for further research and development.
Reach us at info@study.space
[slides] SemCoder%3A Training Code Language Models with Comprehensive Semantics | StudySpace