Understanding SemCoder%3A Training Code Language Models with Comprehensive Semantics

The paper "SEMCODER: Training Code Language Models with Comprehensive Semantics" addresses the limitations of current Code Large Language Models (Code LLMs) in understanding and reasoning about complex program semantics, particularly in tasks like debugging and program repair. The authors introduce a novel training strategy for Code LLMs to incorporate comprehensive semantics, including high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior. They develop SEMCODER, a Code LLM with 6.7B parameters, which demonstrates competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SEMCODER achieves 81.1% on HumanEval and 54.5% on CRUXEval-1, outperforming other models. The paper also introduces Monologue Reasoning, a method where SEMCODER articulates code execution step-by-step, enhancing its ability to handle abstract semantics and non-deterministic program states. Additionally, SEMCODER shows potential in debugging and self-refining capabilities, improving code generation accuracy through iterative refinement. The authors curate the PyX dataset, a synthetic dataset capturing comprehensive program semantics, and demonstrate the effectiveness of their approach through extensive experiments.The paper "SEMCODER: Training Code Language Models with Comprehensive Semantics" addresses the limitations of current Code Large Language Models (Code LLMs) in understanding and reasoning about complex program semantics, particularly in tasks like debugging and program repair. The authors introduce a novel training strategy for Code LLMs to incorporate comprehensive semantics, including high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior. They develop SEMCODER, a Code LLM with 6.7B parameters, which demonstrates competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SEMCODER achieves 81.1% on HumanEval and 54.5% on CRUXEval-1, outperforming other models. The paper also introduces Monologue Reasoning, a method where SEMCODER articulates code execution step-by-step, enhancing its ability to handle abstract semantics and non-deterministic program states. Additionally, SEMCODER shows potential in debugging and self-refining capabilities, improving code generation accuracy through iterative refinement. The authors curate the PyX dataset, a synthetic dataset capturing comprehensive program semantics, and demonstrate the effectiveness of their approach through extensive experiments.

SEMCODER: Training Code Language Models with Comprehensive Semantics

3 Jun 2024 | Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, Baishakhi Ray