Understanding EffiBench%3A Benchmarking the Efficiency of Automatically Generated Code

**Efficiency of Automatically Generated Code Benchmarking** **Authors:** Dong Huang **Abstract:** Code generation models have become integral to software development, but their efficiency, a critical aspect for green computing and sustainability, has often been overlooked. This paper introduces EFFIBENCH, a benchmark designed to assess the efficiency of code generated by 42 large language models (LLMss). EFFIBENCH includes 1,000 efficiency-critical coding problems from LeetCode, each paired with a human-written canonical solution that achieves the highest efficiency on the LeetCode leaderboard. The evaluation reveals that LLMs generally produce less efficient code compared to human-written solutions, with GPT-4's generated code having an average execution time 3.12 times that of human-written solutions. The source code for EFFIBENCH is available on GitHub, and a leaderboard is provided on Hugging Face. **Introduction:** Code generation models, such as GPT-4 and Copilot, are increasingly used to assist developers in various tasks. While these models have been extensively evaluated for correctness, their efficiency remains a significant gap in the literature. Efficient code is crucial for resource-constrained environments and contributes to green computing. EFFIBENCH addresses this gap by focusing on efficiency metrics like execution time and memory usage. **Task Description:** The benchmark includes a problem to merge two sorted arrays into a single sorted array. The input are two sorted arrays, and the output is a single sorted array. An example is provided to illustrate the problem. **Benchmark Construction:** - **Efficiency-critical Problem Collection:** 2,605 initial problems were collected from LeetCode, and 1,146 efficiency-critical problems were selected. - **Canonical Solution Construction:** Executable canonical solutions were manually fixed and provided for each problem. - **Test Case Generation:** A test case generator was developed to produce diverse test cases for each problem. - **Efficiency Metrics:** Metrics include execution time, maximum memory usage, and total memory usage. **Evaluation:** - **End2End Results:** Open-source models like StarCoder2-15B and closed-source models like GPT-4 showed varying degrees of inefficiency compared to human-written solutions. - **Results with Identical Coding Problems:** Analysis of problems correctly addressed by all models showed consistent results. - **Results for Different Algorithms:** Different LLMs performed differently across various algorithm subsets. - **Worst Case Analysis:** Inefficient code generated by GPT-3.5-turbo-0301 was manually analyzed, highlighting issues with dynamic programming and backtracking algorithms. **Conclusion and Future Work:** EFFIBENCH aims to inspire researchers to focus on both correctness and efficiency in code generation. Future work includes expanding language coverage, enhancing dataset diversity, and standardizing testing environments.**Efficiency of Automatically Generated Code Benchmarking** **Authors:** Dong Huang **Abstract:** Code generation models have become integral to software development, but their efficiency, a critical aspect for green computing and sustainability, has often been overlooked. This paper introduces EFFIBENCH, a benchmark designed to assess the efficiency of code generated by 42 large language models (LLMss). EFFIBENCH includes 1,000 efficiency-critical coding problems from LeetCode, each paired with a human-written canonical solution that achieves the highest efficiency on the LeetCode leaderboard. The evaluation reveals that LLMs generally produce less efficient code compared to human-written solutions, with GPT-4's generated code having an average execution time 3.12 times that of human-written solutions. The source code for EFFIBENCH is available on GitHub, and a leaderboard is provided on Hugging Face. **Introduction:** Code generation models, such as GPT-4 and Copilot, are increasingly used to assist developers in various tasks. While these models have been extensively evaluated for correctness, their efficiency remains a significant gap in the literature. Efficient code is crucial for resource-constrained environments and contributes to green computing. EFFIBENCH addresses this gap by focusing on efficiency metrics like execution time and memory usage. **Task Description:** The benchmark includes a problem to merge two sorted arrays into a single sorted array. The input are two sorted arrays, and the output is a single sorted array. An example is provided to illustrate the problem. **Benchmark Construction:** - **Efficiency-critical Problem Collection:** 2,605 initial problems were collected from LeetCode, and 1,146 efficiency-critical problems were selected. - **Canonical Solution Construction:** Executable canonical solutions were manually fixed and provided for each problem. - **Test Case Generation:** A test case generator was developed to produce diverse test cases for each problem. - **Efficiency Metrics:** Metrics include execution time, maximum memory usage, and total memory usage. **Evaluation:** - **End2End Results:** Open-source models like StarCoder2-15B and closed-source models like GPT-4 showed varying degrees of inefficiency compared to human-written solutions. - **Results with Identical Coding Problems:** Analysis of problems correctly addressed by all models showed consistent results. - **Results for Different Algorithms:** Different LLMs performed differently across various algorithm subsets. - **Worst Case Analysis:** Inefficient code generated by GPT-3.5-turbo-0301 was manually analyzed, highlighting issues with dynamic programming and backtracking algorithms. **Conclusion and Future Work:** EFFIBENCH aims to inspire researchers to focus on both correctness and efficiency in code generation. Future work includes expanding language coverage, enhancing dataset diversity, and standardizing testing environments.

EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

4 Jul 2024 | Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang