2024 | Ruizhong Qiu, Weiliang Will Zeng, Hanghang Tong, James Ezick, Christopher Lott
The paper introduces ENAMEL, a rigorous and high-standard benchmark for evaluating the efficiency of code generated by large language models (LLMs). It proposes a new efficiency metric, eff@k, which generalizes the pass@k metric from correctness to efficiency. The metric accounts for right-censored execution times and provides an unbiased, variance-reduced estimator through Rao–Blackwellization. The benchmark includes expert-designed reference solutions and strong test case generators to ensure a fair and comprehensive evaluation. An extensive study across 30 popular LLMs shows that current LLMs still fall short of generating expert-level efficient code. The results indicate that LLMs struggle with designing advanced algorithms and are barely aware of implementation optimization. The benchmark is publicly available at https://github.com/q-rz/enamel.The paper introduces ENAMEL, a rigorous and high-standard benchmark for evaluating the efficiency of code generated by large language models (LLMs). It proposes a new efficiency metric, eff@k, which generalizes the pass@k metric from correctness to efficiency. The metric accounts for right-censored execution times and provides an unbiased, variance-reduced estimator through Rao–Blackwellization. The benchmark includes expert-designed reference solutions and strong test case generators to ensure a fair and comprehensive evaluation. An extensive study across 30 popular LLMs shows that current LLMs still fall short of generating expert-level efficient code. The results indicate that LLMs struggle with designing advanced algorithms and are barely aware of implementation optimization. The benchmark is publicly available at https://github.com/q-rz/enamel.