4 Jul 2024 | Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang
**Efficiency of Automatically Generated Code Benchmarking**
**Authors:** Dong Huang
**Abstract:**
Code generation models have become integral to software development, but their efficiency, a critical aspect for green computing and sustainability, has often been overlooked. This paper introduces EFFIBENCH, a benchmark designed to assess the efficiency of code generated by 42 large language models (LLMss). EFFIBENCH includes 1,000 efficiency-critical coding problems from LeetCode, each paired with a human-written canonical solution that achieves the highest efficiency on the LeetCode leaderboard. The evaluation reveals that LLMs generally produce less efficient code compared to human-written solutions, with GPT-4's generated code having an average execution time 3.12 times that of human-written solutions. The source code for EFFIBENCH is available on GitHub, and a leaderboard is provided on Hugging Face.
**Introduction:**
Code generation models, such as GPT-4 and Copilot, are increasingly used to assist developers in various tasks. While these models have been extensively evaluated for correctness, their efficiency remains a significant gap in the literature. Efficient code is crucial for resource-constrained environments and contributes to green computing. EFFIBENCH addresses this gap by focusing on efficiency metrics like execution time and memory usage.
**Task Description:**
The benchmark includes a problem to merge two sorted arrays into a single sorted array. The input are two sorted arrays, and the output is a single sorted array. An example is provided to illustrate the problem.
**Benchmark Construction:**
- **Efficiency-critical Problem Collection:** 2,605 initial problems were collected from LeetCode, and 1,146 efficiency-critical problems were selected.
- **Canonical Solution Construction:** Executable canonical solutions were manually fixed and provided for each problem.
- **Test Case Generation:** A test case generator was developed to produce diverse test cases for each problem.
- **Efficiency Metrics:** Metrics include execution time, maximum memory usage, and total memory usage.
**Evaluation:**
- **End2End Results:** Open-source models like StarCoder2-15B and closed-source models like GPT-4 showed varying degrees of inefficiency compared to human-written solutions.
- **Results with Identical Coding Problems:** Analysis of problems correctly addressed by all models showed consistent results.
- **Results for Different Algorithms:** Different LLMs performed differently across various algorithm subsets.
- **Worst Case Analysis:** Inefficient code generated by GPT-3.5-turbo-0301 was manually analyzed, highlighting issues with dynamic programming and backtracking algorithms.
**Conclusion and Future Work:**
EFFIBENCH aims to inspire researchers to focus on both correctness and efficiency in code generation. Future work includes expanding language coverage, enhancing dataset diversity, and standardizing testing environments.**Efficiency of Automatically Generated Code Benchmarking**
**Authors:** Dong Huang
**Abstract:**
Code generation models have become integral to software development, but their efficiency, a critical aspect for green computing and sustainability, has often been overlooked. This paper introduces EFFIBENCH, a benchmark designed to assess the efficiency of code generated by 42 large language models (LLMss). EFFIBENCH includes 1,000 efficiency-critical coding problems from LeetCode, each paired with a human-written canonical solution that achieves the highest efficiency on the LeetCode leaderboard. The evaluation reveals that LLMs generally produce less efficient code compared to human-written solutions, with GPT-4's generated code having an average execution time 3.12 times that of human-written solutions. The source code for EFFIBENCH is available on GitHub, and a leaderboard is provided on Hugging Face.
**Introduction:**
Code generation models, such as GPT-4 and Copilot, are increasingly used to assist developers in various tasks. While these models have been extensively evaluated for correctness, their efficiency remains a significant gap in the literature. Efficient code is crucial for resource-constrained environments and contributes to green computing. EFFIBENCH addresses this gap by focusing on efficiency metrics like execution time and memory usage.
**Task Description:**
The benchmark includes a problem to merge two sorted arrays into a single sorted array. The input are two sorted arrays, and the output is a single sorted array. An example is provided to illustrate the problem.
**Benchmark Construction:**
- **Efficiency-critical Problem Collection:** 2,605 initial problems were collected from LeetCode, and 1,146 efficiency-critical problems were selected.
- **Canonical Solution Construction:** Executable canonical solutions were manually fixed and provided for each problem.
- **Test Case Generation:** A test case generator was developed to produce diverse test cases for each problem.
- **Efficiency Metrics:** Metrics include execution time, maximum memory usage, and total memory usage.
**Evaluation:**
- **End2End Results:** Open-source models like StarCoder2-15B and closed-source models like GPT-4 showed varying degrees of inefficiency compared to human-written solutions.
- **Results with Identical Coding Problems:** Analysis of problems correctly addressed by all models showed consistent results.
- **Results for Different Algorithms:** Different LLMs performed differently across various algorithm subsets.
- **Worst Case Analysis:** Inefficient code generated by GPT-3.5-turbo-0301 was manually analyzed, highlighting issues with dynamic programming and backtracking algorithms.
**Conclusion and Future Work:**
EFFIBENCH aims to inspire researchers to focus on both correctness and efficiency in code generation. Future work includes expanding language coverage, enhancing dataset diversity, and standardizing testing environments.