16 Jun 2024 | Ruizhong Qiu† Weiliang Will Zeng‡ Hanghang Tong† James Ezick‡ Christopher Lott‡
The paper introduces ENAMEL (EfficeNcy AutoMatic Evaluator), a rigorous and high-standard benchmark for evaluating the efficiency of code generated by large language models (LLMs). The benchmark addresses several challenges in existing evaluations, including right-censored execution time, efficiency vs. sample size, algorithm design and implementation optimization, correctness, and worst-case efficiency. ENAMEL proposes a new efficiency metric called eff@k, which generalizes the pass@k metric from correctness to efficiency and handles right-censored execution time appropriately. It also provides an unbiased, variance-reduced estimator of eff@k through Rao-Blackwellization. To set a high standard, the benchmark includes expert-written reference solutions and strong test case generators curated by a human expert. Extensive evaluations across 30 popular LLMs show that LLMs still fall short of generating expert-level efficient code, particularly in designing advanced algorithms and implementing optimizations. The benchmark is publicly available at <https://github.com/q-rz/enameL>.The paper introduces ENAMEL (EfficeNcy AutoMatic Evaluator), a rigorous and high-standard benchmark for evaluating the efficiency of code generated by large language models (LLMs). The benchmark addresses several challenges in existing evaluations, including right-censored execution time, efficiency vs. sample size, algorithm design and implementation optimization, correctness, and worst-case efficiency. ENAMEL proposes a new efficiency metric called eff@k, which generalizes the pass@k metric from correctness to efficiency and handles right-censored execution time appropriately. It also provides an unbiased, variance-reduced estimator of eff@k through Rao-Blackwellization. To set a high standard, the benchmark includes expert-written reference solutions and strong test case generators curated by a human expert. Extensive evaluations across 30 popular LLMs show that LLMs still fall short of generating expert-level efficient code, particularly in designing advanced algorithms and implementing optimizations. The benchmark is publicly available at <https://github.com/q-rz/enameL>.