EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

4 Jul 2024 | Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang
**Efficiency of Automatically Generated Code Benchmarking** **Authors:** Dong Huang **Abstract:** Code generation models have become integral to software development, but their efficiency, a critical aspect for green computing and sustainability, has often been overlooked. This paper introduces EFFIBENCH, a benchmark designed to assess the efficiency of code generated by 42 large language models (LLMss). EFFIBENCH includes 1,000 efficiency-critical coding problems from LeetCode, each paired with a human-written canonical solution that achieves the highest efficiency on the LeetCode leaderboard. The evaluation reveals that LLMs generally produce less efficient code compared to human-written solutions, with GPT-4's generated code having an average execution time 3.12 times that of human-written solutions. The source code for EFFIBENCH is available on GitHub, and a leaderboard is provided on Hugging Face. **Introduction:** Code generation models, such as GPT-4 and Copilot, are increasingly used to assist developers in various tasks. While these models have been extensively evaluated for correctness, their efficiency remains a significant gap in the literature. Efficient code is crucial for resource-constrained environments and contributes to green computing. EFFIBENCH addresses this gap by focusing on efficiency metrics like execution time and memory usage. **Task Description:** The benchmark includes a problem to merge two sorted arrays into a single sorted array. The input are two sorted arrays, and the output is a single sorted array. An example is provided to illustrate the problem. **Benchmark Construction:** - **Efficiency-critical Problem Collection:** 2,605 initial problems were collected from LeetCode, and 1,146 efficiency-critical problems were selected. - **Canonical Solution Construction:** Executable canonical solutions were manually fixed and provided for each problem. - **Test Case Generation:** A test case generator was developed to produce diverse test cases for each problem. - **Efficiency Metrics:** Metrics include execution time, maximum memory usage, and total memory usage. **Evaluation:** - **End2End Results:** Open-source models like StarCoder2-15B and closed-source models like GPT-4 showed varying degrees of inefficiency compared to human-written solutions. - **Results with Identical Coding Problems:** Analysis of problems correctly addressed by all models showed consistent results. - **Results for Different Algorithms:** Different LLMs performed differently across various algorithm subsets. - **Worst Case Analysis:** Inefficient code generated by GPT-3.5-turbo-0301 was manually analyzed, highlighting issues with dynamic programming and backtracking algorithms. **Conclusion and Future Work:** EFFIBENCH aims to inspire researchers to focus on both correctness and efficiency in code generation. Future work includes expanding language coverage, enhancing dataset diversity, and standardizing testing environments.**Efficiency of Automatically Generated Code Benchmarking** **Authors:** Dong Huang **Abstract:** Code generation models have become integral to software development, but their efficiency, a critical aspect for green computing and sustainability, has often been overlooked. This paper introduces EFFIBENCH, a benchmark designed to assess the efficiency of code generated by 42 large language models (LLMss). EFFIBENCH includes 1,000 efficiency-critical coding problems from LeetCode, each paired with a human-written canonical solution that achieves the highest efficiency on the LeetCode leaderboard. The evaluation reveals that LLMs generally produce less efficient code compared to human-written solutions, with GPT-4's generated code having an average execution time 3.12 times that of human-written solutions. The source code for EFFIBENCH is available on GitHub, and a leaderboard is provided on Hugging Face. **Introduction:** Code generation models, such as GPT-4 and Copilot, are increasingly used to assist developers in various tasks. While these models have been extensively evaluated for correctness, their efficiency remains a significant gap in the literature. Efficient code is crucial for resource-constrained environments and contributes to green computing. EFFIBENCH addresses this gap by focusing on efficiency metrics like execution time and memory usage. **Task Description:** The benchmark includes a problem to merge two sorted arrays into a single sorted array. The input are two sorted arrays, and the output is a single sorted array. An example is provided to illustrate the problem. **Benchmark Construction:** - **Efficiency-critical Problem Collection:** 2,605 initial problems were collected from LeetCode, and 1,146 efficiency-critical problems were selected. - **Canonical Solution Construction:** Executable canonical solutions were manually fixed and provided for each problem. - **Test Case Generation:** A test case generator was developed to produce diverse test cases for each problem. - **Efficiency Metrics:** Metrics include execution time, maximum memory usage, and total memory usage. **Evaluation:** - **End2End Results:** Open-source models like StarCoder2-15B and closed-source models like GPT-4 showed varying degrees of inefficiency compared to human-written solutions. - **Results with Identical Coding Problems:** Analysis of problems correctly addressed by all models showed consistent results. - **Results for Different Algorithms:** Different LLMs performed differently across various algorithm subsets. - **Worst Case Analysis:** Inefficient code generated by GPT-3.5-turbo-0301 was manually analyzed, highlighting issues with dynamic programming and backtracking algorithms. **Conclusion and Future Work:** EFFIBENCH aims to inspire researchers to focus on both correctness and efficiency in code generation. Future work includes expanding language coverage, enhancing dataset diversity, and standardizing testing environments.
Reach us at info@study.space