On Evaluating the Efficiency of Source Code Generated by LLMs

On Evaluating the Efficiency of Source Code Generated by LLMs

April 14, 2024 | Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, Vincent Ng
This paper evaluates the efficiency of source code generated by large language models (LLMs). The study focuses on assessing the efficiency of code generated by LLMs, which is often overlooked in previous research that primarily evaluates correctness. The authors propose to evaluate the efficiency of code generated by LLMs on two benchmarks, HumanEval and MBPP, and then on a more challenging benchmark from LeetCode. They also explore various prompts that can help LLMs generate more efficient code. The study finds that the ability to generate correct code is not necessarily correlated with the ability to generate efficient code. For example, GPT-4 has a higher Pass@10 score than GPT-3.5, but the code generated by GPT-4 is not as efficient as that of GPT-3.5. Additionally, the number of parameters in a model does not always lead to better performance. Code Llama and WizardCoder demonstrate that increasing the number of parameters does not significantly affect the runtime of generated code across models of different sizes. Training strategy and data also impact the efficiency of generated code. For instance, DeepSeek Coder 33B Instruct outperforms its base version. The study also explores different prompting strategies to improve code efficiency. Three prompt methods are tested, with the best results achieved by using a chain-of-thought prompt. The results show that step-by-step prompting can lead to more efficient code, especially for complex problems. The study concludes that the efficiency of LLM-generated code is independent of the model's performance in generating correct code and model size. Future work will focus on proposing a novel prompt method to enhance LLM-generated code efficiency.This paper evaluates the efficiency of source code generated by large language models (LLMs). The study focuses on assessing the efficiency of code generated by LLMs, which is often overlooked in previous research that primarily evaluates correctness. The authors propose to evaluate the efficiency of code generated by LLMs on two benchmarks, HumanEval and MBPP, and then on a more challenging benchmark from LeetCode. They also explore various prompts that can help LLMs generate more efficient code. The study finds that the ability to generate correct code is not necessarily correlated with the ability to generate efficient code. For example, GPT-4 has a higher Pass@10 score than GPT-3.5, but the code generated by GPT-4 is not as efficient as that of GPT-3.5. Additionally, the number of parameters in a model does not always lead to better performance. Code Llama and WizardCoder demonstrate that increasing the number of parameters does not significantly affect the runtime of generated code across models of different sizes. Training strategy and data also impact the efficiency of generated code. For instance, DeepSeek Coder 33B Instruct outperforms its base version. The study also explores different prompting strategies to improve code efficiency. Three prompt methods are tested, with the best results achieved by using a chain-of-thought prompt. The results show that step-by-step prompting can lead to more efficient code, especially for complex problems. The study concludes that the efficiency of LLM-generated code is independent of the model's performance in generating correct code and model size. Future work will focus on proposing a novel prompt method to enhance LLM-generated code efficiency.
Reach us at info@study.space