The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

3 Apr 2024 | Hussein Mozannar*1,2, Valerie Chen*3, Mohammed Alsobay2, Subhro Das1,4, Sebastian Zhao5, Dennis Wei1,4, Manish Nagireddy1,4, Prasanna Sattigeri1,4, Ameet Talwalkar3, and David Sontag1,2
The RealHumanEval study evaluates large language models (LLMs) for their ability to assist programmers. The study introduces a web-based platform to measure how well LLMs support programmers through autocomplete or chat interactions. A user study with 213 participants was conducted, where participants interacted with six LLMs of varying performance. The study found that improvements in benchmark performance led to increased programmer productivity, but gaps between benchmark and human performance were not proportional. Preferences for LLMs did not always align with actual performance, highlighting the need for better, human-centric proxy signals. RealHumanEval is open-sourced to enable human-centric evaluation of new models and study data to improve code models. The platform allows programmers to solve coding tasks with two forms of LLM assistance: autocomplete or chat. The study found that better LLM support can improve task completion time but not necessarily the number of tasks completed. Human preference metrics, such as acceptance rates of suggestions and copying code from chat responses, did not always correlate with actual performance. The study also found that chat support was perceived as more helpful than autocomplete support. The results suggest that static benchmarks may not fully capture the practical impact of LLMs on programmer productivity. The study highlights the importance of human-centric evaluation to understand how programmers interact with LLMs and to develop better models. The findings emphasize the need for further research to improve LLMs for programming tasks.The RealHumanEval study evaluates large language models (LLMs) for their ability to assist programmers. The study introduces a web-based platform to measure how well LLMs support programmers through autocomplete or chat interactions. A user study with 213 participants was conducted, where participants interacted with six LLMs of varying performance. The study found that improvements in benchmark performance led to increased programmer productivity, but gaps between benchmark and human performance were not proportional. Preferences for LLMs did not always align with actual performance, highlighting the need for better, human-centric proxy signals. RealHumanEval is open-sourced to enable human-centric evaluation of new models and study data to improve code models. The platform allows programmers to solve coding tasks with two forms of LLM assistance: autocomplete or chat. The study found that better LLM support can improve task completion time but not necessarily the number of tasks completed. Human preference metrics, such as acceptance rates of suggestions and copying code from chat responses, did not always correlate with actual performance. The study also found that chat support was perceived as more helpful than autocomplete support. The results suggest that static benchmarks may not fully capture the practical impact of LLMs on programmer productivity. The study highlights the importance of human-centric evaluation to understand how programmers interact with LLMs and to develop better models. The findings emphasize the need for further research to improve LLMs for programming tasks.
Reach us at info@study.space
[slides] The RealHumanEval%3A Evaluating Large Language Models' Abilities to Support Programmers | StudySpace