EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

4 Jul 2024 | Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang
EFFIBENCH is a benchmark designed to evaluate the efficiency of code generated by large language models (LLMs). It includes 1,000 efficiency-critical coding problems selected from LeetCode. Each problem is paired with an executable human-written canonical solution that achieves the highest efficiency on the LeetCode solution leaderboard. The benchmark assesses the efficiency of code generated by 42 LLMs, including 35 open-source and 7 closed-source models. The results show that the efficiency of code generated by LLMs is generally worse than that of human-written canonical solutions. For example, GPT-4 generated code has an average execution time that is 3.12 times that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are 13.89 and 43.92 times that of the canonical solutions, respectively. The source code of Effibench is released on GitHub, and a leaderboard is provided on Hugging Face. The benchmark includes various efficiency metrics such as execution time, memory usage, and total memory usage. The study also highlights the importance of efficiency in code generation, as it directly impacts the speed of execution and memory utilization, which is especially important in resource-constrained environments. The paper concludes that EFFIBENCH is the first benchmark specifically designed to assess the efficiency of code generated by LLMs. It also reveals that even state-of-the-art LLMs, such as GPT-4, exhibit significant inefficiencies compared to optimal human-written solutions. The paper also provides an efficiency testing framework that enables evaluating the efficiency across various code generation benchmarks.EFFIBENCH is a benchmark designed to evaluate the efficiency of code generated by large language models (LLMs). It includes 1,000 efficiency-critical coding problems selected from LeetCode. Each problem is paired with an executable human-written canonical solution that achieves the highest efficiency on the LeetCode solution leaderboard. The benchmark assesses the efficiency of code generated by 42 LLMs, including 35 open-source and 7 closed-source models. The results show that the efficiency of code generated by LLMs is generally worse than that of human-written canonical solutions. For example, GPT-4 generated code has an average execution time that is 3.12 times that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are 13.89 and 43.92 times that of the canonical solutions, respectively. The source code of Effibench is released on GitHub, and a leaderboard is provided on Hugging Face. The benchmark includes various efficiency metrics such as execution time, memory usage, and total memory usage. The study also highlights the importance of efficiency in code generation, as it directly impacts the speed of execution and memory utilization, which is especially important in resource-constrained environments. The paper concludes that EFFIBENCH is the first benchmark specifically designed to assess the efficiency of code generated by LLMs. It also reveals that even state-of-the-art LLMs, such as GPT-4, exhibit significant inefficiencies compared to optimal human-written solutions. The paper also provides an efficiency testing framework that enables evaluating the efficiency across various code generation benchmarks.
Reach us at info@study.space
Understanding EffiBench%3A Benchmarking the Efficiency of Automatically Generated Code