Language Models for Code Completion: A Practical Evaluation

Language Models for Code Completion: A Practical Evaluation

April 14–20, 2024 | Maliheh Izadi, Jonathan Katzy, Tim van Dam, Marc Otten, Razvan Mihai Popescu, Arie van Deursen
This study evaluates three public code language models—InCoder, UniXcoder, and CodeGPT—for code completion in real-world scenarios. The researchers developed an open-source IDE extension called Code4Me to collect real-world auto-completion data from over 1,200 users, resulting in over 600,000 valid completions. The models were evaluated across twelve programming languages using six standard metrics. A qualitative analysis of 1,690 real-world completion requests identified the reasons behind poor model performance. The study also compared the models' performance in online and offline settings using benchmark synthetic datasets and two masking strategies. The findings suggest that while developers use code completion across various languages, the best results are achieved for mainstream languages like Python and Java. InCoder outperformed the other models across all languages, highlighting the importance of training data and objectives. The study found that offline evaluations do not accurately reflect real-world scenarios. Qualitative analysis revealed that 66.3% of failures were due to models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Based on these findings, the researchers propose strategies to overcome current limitations, including refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability. The study contributes an open-source IDE extension (Code4Me), quantitative online and offline evaluations across twelve programming languages, an analysis of models’ limitations using 1,690 completions resulting in a taxonomy of 18 causes of poor performance, and public source code, dataset for offline evaluation, and open coding data for qualitative analysis. The study also highlights the importance of considering real-world contexts when evaluating code completion models and suggests future research directions to improve their effectiveness.This study evaluates three public code language models—InCoder, UniXcoder, and CodeGPT—for code completion in real-world scenarios. The researchers developed an open-source IDE extension called Code4Me to collect real-world auto-completion data from over 1,200 users, resulting in over 600,000 valid completions. The models were evaluated across twelve programming languages using six standard metrics. A qualitative analysis of 1,690 real-world completion requests identified the reasons behind poor model performance. The study also compared the models' performance in online and offline settings using benchmark synthetic datasets and two masking strategies. The findings suggest that while developers use code completion across various languages, the best results are achieved for mainstream languages like Python and Java. InCoder outperformed the other models across all languages, highlighting the importance of training data and objectives. The study found that offline evaluations do not accurately reflect real-world scenarios. Qualitative analysis revealed that 66.3% of failures were due to models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Based on these findings, the researchers propose strategies to overcome current limitations, including refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability. The study contributes an open-source IDE extension (Code4Me), quantitative online and offline evaluations across twelve programming languages, an analysis of models’ limitations using 1,690 completions resulting in a taxonomy of 18 causes of poor performance, and public source code, dataset for offline evaluation, and open coding data for qualitative analysis. The study also highlights the importance of considering real-world contexts when evaluating code completion models and suggests future research directions to improve their effectiveness.
Reach us at info@study.space