This paper proposes REval, a comprehensive framework for evaluating the code reasoning ability and consistency of code LLMs. The framework includes two evaluation components: Runtime Behavior Reasoning and Incremental Consistency Evaluation. The first component evaluates the ability of code LLMs to predict the intermediate states of program execution, including code coverage, program state, execution path, and output. The second component measures the consistency of code LLMs in sequentially related tasks. The framework is built upon existing code benchmarks and adapted to evaluate code LLMs with runtime behavior of execution. A large-scale empirical study shows that most LLMs perform poorly on both Runtime Behavior Reasoning (average accuracy of 44.4%) and Incremental Consistency Evaluation (average IC score of 10.3). The results highlight the importance of evaluating code reasoning ability with runtime behavior and incremental consistency. The paper calls for targeted efforts to enhance these weaknesses in code LLMs. The framework, code, data, and REval leaderboard are available at https://r-eval.github.io.This paper proposes REval, a comprehensive framework for evaluating the code reasoning ability and consistency of code LLMs. The framework includes two evaluation components: Runtime Behavior Reasoning and Incremental Consistency Evaluation. The first component evaluates the ability of code LLMs to predict the intermediate states of program execution, including code coverage, program state, execution path, and output. The second component measures the consistency of code LLMs in sequentially related tasks. The framework is built upon existing code benchmarks and adapted to evaluate code LLMs with runtime behavior of execution. A large-scale empirical study shows that most LLMs perform poorly on both Runtime Behavior Reasoning (average accuracy of 44.4%) and Incremental Consistency Evaluation (average IC score of 10.3). The results highlight the importance of evaluating code reasoning ability with runtime behavior and incremental consistency. The paper calls for targeted efforts to enhance these weaknesses in code LLMs. The framework, code, data, and REval leaderboard are available at https://r-eval.github.io.