The paper "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?" by Junkai Chen addresses the limitations of current benchmarks for evaluating the code reasoning capabilities of large language models (LLMs). Traditional benchmarks like HumanEval and ClassEval focus on predicting the input and output of programs but neglect intermediate behavior and logical consistency during program execution. To address these gaps, the authors propose a new framework called REval, which evaluates LLMs' ability to reason about runtime behavior and incremental consistency.
REval consists of two main components:
1. **Runtime Behavior Reasoning**: This component includes four tasks—Code Coverage Prediction (CCP), Program State Prediction (PSP), Execution Path Prediction (EPP), and Output Prediction (OP)—to assess the model's ability to predict intermediate states and control flow.
2. **Incremental Consistency Evaluation**: This component introduces a novel metric, Incremental Consistency (IC), to measure the model's ability to maintain logical consistency across sequentially related tasks.
The authors conduct a large-scale empirical study using existing code benchmarks (HumanEval and ClassEval) and adapt them to fit the REval framework. The results show that most LLMs perform poorly on both Runtime Behavior Reasoning (average accuracy of 44.4%) and Incremental Consistency Evaluation (average IC score of 10.3). The study highlights the urgent need for the community to enhance the code reasoning capabilities of LLMs, particularly in understanding and reasoning about runtime behavior and maintaining logical consistency.
The paper also discusses the background and related work, including the importance of code execution behavior and logical consistency in LLMs. It provides a detailed overview of the REval framework, including the construction of adapted benchmarks and the experimental setup. The results and discussions section analyzes the performance of different LLMs on the proposed tasks and identifies areas for improvement. Overall, the study emphasizes the need for more comprehensive and realistic benchmarks to better evaluate and improve the code reasoning abilities of LLMs.The paper "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?" by Junkai Chen addresses the limitations of current benchmarks for evaluating the code reasoning capabilities of large language models (LLMs). Traditional benchmarks like HumanEval and ClassEval focus on predicting the input and output of programs but neglect intermediate behavior and logical consistency during program execution. To address these gaps, the authors propose a new framework called REval, which evaluates LLMs' ability to reason about runtime behavior and incremental consistency.
REval consists of two main components:
1. **Runtime Behavior Reasoning**: This component includes four tasks—Code Coverage Prediction (CCP), Program State Prediction (PSP), Execution Path Prediction (EPP), and Output Prediction (OP)—to assess the model's ability to predict intermediate states and control flow.
2. **Incremental Consistency Evaluation**: This component introduces a novel metric, Incremental Consistency (IC), to measure the model's ability to maintain logical consistency across sequentially related tasks.
The authors conduct a large-scale empirical study using existing code benchmarks (HumanEval and ClassEval) and adapt them to fit the REval framework. The results show that most LLMs perform poorly on both Runtime Behavior Reasoning (average accuracy of 44.4%) and Incremental Consistency Evaluation (average IC score of 10.3). The study highlights the urgent need for the community to enhance the code reasoning capabilities of LLMs, particularly in understanding and reasoning about runtime behavior and maintaining logical consistency.
The paper also discusses the background and related work, including the importance of code execution behavior and logical consistency in LLMs. It provides a detailed overview of the REval framework, including the construction of adapted benchmarks and the experimental setup. The results and discussions section analyzes the performance of different LLMs on the proposed tasks and identifies areas for improvement. Overall, the study emphasizes the need for more comprehensive and realistic benchmarks to better evaluate and improve the code reasoning abilities of LLMs.