[slides and audio] PythonSaga%3A Redefining the Benchmark to Evaluate Code Generating LLMs

The paper "PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs" by Ankit Yadav, Himanshu Beniwal, and Mayank Singh from IIT Gandhinagar, India, addresses the limitations of existing benchmarks for evaluating large language models (LLMs) in code generation. The authors conducted a large-scale human evaluation of two popular benchmarks, *HumanEval* and *MBPP*, for Python code generation, finding that these benchmarks have a significant bias towards a limited set of programming concepts and an overrepresentation of easy tasks. To address these issues, they propose a new benchmark, PythonSaga, which includes 185 hand-crafted prompts balanced across 38 programming concepts and three difficulty levels. The robustness of PythonSaga is demonstrated by the poor performance of existing Code-LLMs on it, highlighting the need for more comprehensive and diverse evaluation frameworks. The code and dataset are openly available to the NLP community. The paper also discusses the limitations of existing benchmarks, the proposed PythonSaga benchmark, and the performance of various Code-LLMs on it, providing insights into the strengths and weaknesses of current models in code generation tasks.The paper "PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs" by Ankit Yadav, Himanshu Beniwal, and Mayank Singh from IIT Gandhinagar, India, addresses the limitations of existing benchmarks for evaluating large language models (LLMs) in code generation. The authors conducted a large-scale human evaluation of two popular benchmarks, *HumanEval* and *MBPP*, for Python code generation, finding that these benchmarks have a significant bias towards a limited set of programming concepts and an overrepresentation of easy tasks. To address these issues, they propose a new benchmark, PythonSaga, which includes 185 hand-crafted prompts balanced across 38 programming concepts and three difficulty levels. The robustness of PythonSaga is demonstrated by the poor performance of existing Code-LLMs on it, highlighting the need for more comprehensive and diverse evaluation frameworks. The code and dataset are openly available to the NLP community. The paper also discusses the limitations of existing benchmarks, the proposed PythonSaga benchmark, and the performance of various Code-LLMs on it, providing insights into the strengths and weaknesses of current models in code generation tasks.

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

4 Jul 2024 | Ankit Yadav, Himanshu Beniwal, Mayank Singh