3 Apr 2024 | Hussein Mozannar*1,2, Valerie Chen*3, Mohammed Alsobay2, Subhro Das1,4, Sebastian Zhao5, Dennis Wei1,4, Manish Nagireddy1,4, Prasanna Sattigeri1,4, Ameet Talwalkar3, and David Sontag1,2
The paper "RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers" by Hussein Mozannar et al. introduces RealHumanEval, a web-based platform designed to evaluate the effectiveness of large language models (LLMs) in assisting programmers. The platform supports two forms of LLM assistance: autocomplete and chat support, and it logs user behavior to measure productivity metrics such as task completion time and acceptance rates of suggestions.
The study conducted a user survey with 213 participants, who interacted with six LLMs of varying performance levels through RealHumanEval. The results show that improvements in LLM performance on static benchmarks, such as HumanEval, lead to increased programmer productivity, particularly in terms of time spent coding. However, the gains in benchmark performance do not always translate to equivalent gains in human productivity, indicating that further gains in benchmark performance may not be necessary for practical utility.
The study also investigates the correlation between human preference metrics, such as acceptance rates and copy rates, and actual programmer performance. It finds that these preference metrics do not align with actual performance, suggesting that they are influenced by factors other than downstream utility. The findings highlight the need for better, human-centric proxy signals to evaluate LLMs effectively.
The paper concludes by recommending the use of RealHumanEval to bridge the gap between offline and human evaluations, and suggests future work on improving LLMs to better support programmers, particularly in inferring context and providing more tailored assistance.The paper "RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers" by Hussein Mozannar et al. introduces RealHumanEval, a web-based platform designed to evaluate the effectiveness of large language models (LLMs) in assisting programmers. The platform supports two forms of LLM assistance: autocomplete and chat support, and it logs user behavior to measure productivity metrics such as task completion time and acceptance rates of suggestions.
The study conducted a user survey with 213 participants, who interacted with six LLMs of varying performance levels through RealHumanEval. The results show that improvements in LLM performance on static benchmarks, such as HumanEval, lead to increased programmer productivity, particularly in terms of time spent coding. However, the gains in benchmark performance do not always translate to equivalent gains in human productivity, indicating that further gains in benchmark performance may not be necessary for practical utility.
The study also investigates the correlation between human preference metrics, such as acceptance rates and copy rates, and actual programmer performance. It finds that these preference metrics do not align with actual performance, suggesting that they are influenced by factors other than downstream utility. The findings highlight the need for better, human-centric proxy signals to evaluate LLMs effectively.
The paper concludes by recommending the use of RealHumanEval to bridge the gap between offline and human evaluations, and suggests future work on improving LLMs to better support programmers, particularly in inferring context and providing more tailored assistance.