5 Jan 2024 | Alex Gu*, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang
CRUXEval is a benchmark for evaluating code reasoning, understanding, and execution capabilities of language models. It consists of 800 Python functions, each with an input-output pair, and includes two tasks: output prediction (CRUXEval-O) and input prediction (CRUXEval-I). The benchmark was created using a three-step process: generating functions and inputs, filtering for simplicity and solvability, and selecting 800 samples. The benchmark evaluates the ability of models to predict code execution outcomes and to reason about code behavior.
The benchmark was evaluated on 20 code models, including GPT-4, Code Llama 34B, and several open-source models. GPT-4 achieved a pass@1 of 75% on input prediction and 81% on output prediction, while Code Llama 34B achieved 50% and 46%, respectively. The results show a significant gap between open-source and closed-source models. The best models still struggle with the benchmark, indicating that code reasoning and execution capabilities are not fully mastered by current models.
The benchmark highlights the importance of evaluating code understanding and execution beyond code generation. It also shows that simple techniques like chain-of-thought (CoT) and fine-tuning can improve performance but are not sufficient to solve the benchmark. The benchmark provides insights into the capabilities of code models and the challenges they face in reasoning about code execution. It also reveals that even advanced models like GPT-4 have consistent failures on simple programs, indicating areas for improvement. The benchmark is a valuable tool for evaluating and improving code reasoning and execution capabilities of language models.CRUXEval is a benchmark for evaluating code reasoning, understanding, and execution capabilities of language models. It consists of 800 Python functions, each with an input-output pair, and includes two tasks: output prediction (CRUXEval-O) and input prediction (CRUXEval-I). The benchmark was created using a three-step process: generating functions and inputs, filtering for simplicity and solvability, and selecting 800 samples. The benchmark evaluates the ability of models to predict code execution outcomes and to reason about code behavior.
The benchmark was evaluated on 20 code models, including GPT-4, Code Llama 34B, and several open-source models. GPT-4 achieved a pass@1 of 75% on input prediction and 81% on output prediction, while Code Llama 34B achieved 50% and 46%, respectively. The results show a significant gap between open-source and closed-source models. The best models still struggle with the benchmark, indicating that code reasoning and execution capabilities are not fully mastered by current models.
The benchmark highlights the importance of evaluating code understanding and execution beyond code generation. It also shows that simple techniques like chain-of-thought (CoT) and fine-tuning can improve performance but are not sufficient to solve the benchmark. The benchmark provides insights into the capabilities of code models and the challenges they face in reasoning about code execution. It also reveals that even advanced models like GPT-4 have consistent failures on simple programs, indicating areas for improvement. The benchmark is a valuable tool for evaluating and improving code reasoning and execution capabilities of language models.