CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

5 Jan 2024 | Alex Gu*, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang
CRUXEval is a new benchmark designed to evaluate the input and output prediction abilities of code language models (LMs). The benchmark consists of 800 Python functions, each with an input-output pair, leading to two natural tasks: input prediction and output prediction. The authors propose a generic recipe for generating the benchmark, which can be used to create future variations. They evaluate 20 code models on the benchmark and find that many recent high-scoring models on HumanEval do not show the same improvements. Simple CoT (chain of thought) and fine-tuning schemes can improve performance but remain far from solving the benchmark. GPT-4 with CoT achieves a pass@1 of 75% and 81% on input and output prediction, respectively, while Code Llama 34B achieves 50% and 46%. The benchmark highlights the gap between open-source and closed-source models and provides insights into the limitations of current LMs in code reasoning and execution. The authors also provide examples of GPT-4's consistent failures on simple programs to illustrate its code reasoning capabilities and areas for improvement.CRUXEval is a new benchmark designed to evaluate the input and output prediction abilities of code language models (LMs). The benchmark consists of 800 Python functions, each with an input-output pair, leading to two natural tasks: input prediction and output prediction. The authors propose a generic recipe for generating the benchmark, which can be used to create future variations. They evaluate 20 code models on the benchmark and find that many recent high-scoring models on HumanEval do not show the same improvements. Simple CoT (chain of thought) and fine-tuning schemes can improve performance but remain far from solving the benchmark. GPT-4 with CoT achieves a pass@1 of 75% and 81% on input and output prediction, respectively, while Code Llama 34B achieves 50% and 46%. The benchmark highlights the gap between open-source and closed-source models and provides insights into the limitations of current LMs in code reasoning and execution. The authors also provide examples of GPT-4's consistent failures on simple programs to illustrate its code reasoning capabilities and areas for improvement.
Reach us at info@study.space
Understanding CRUXEval%3A A Benchmark for Code Reasoning%2C Understanding and Execution