April 14–20, 2024, Lisbon, Portugal | Maliheh Izadi, Jonathan Katzy, Tim van Dam, Marc Otten, Razvan Mihai Popescu, Arie van Deursen
This study evaluates the performance of three public code language models—InCoder, UniXcoder, and CodeGPT—using real-world code completion data. The authors developed an open-source IDE extension, *Code4Me*, to collect over 600,000 valid completions from more than 1200 users over a year. The models were assessed using six standard metrics across twelve programming languages. A qualitative analysis of 1690 real-world completion requests identified 18 types of failures, with 66.3% due to model limitations, 24.4% due to inappropriate model usage, and 9.3% due to developers overwriting correct predictions. InCoder outperformed the other models, highlighting the importance of training data and objectives. The study also found that offline evaluations do not accurately reflect real-world scenarios, with a significant mismatch between synthetic and real-world data. The findings suggest that future research should focus on broadening the range of programming languages, refining training objectives, and improving model resilience and usability.This study evaluates the performance of three public code language models—InCoder, UniXcoder, and CodeGPT—using real-world code completion data. The authors developed an open-source IDE extension, *Code4Me*, to collect over 600,000 valid completions from more than 1200 users over a year. The models were assessed using six standard metrics across twelve programming languages. A qualitative analysis of 1690 real-world completion requests identified 18 types of failures, with 66.3% due to model limitations, 24.4% due to inappropriate model usage, and 9.3% due to developers overwriting correct predictions. InCoder outperformed the other models, highlighting the importance of training data and objectives. The study also found that offline evaluations do not accurately reflect real-world scenarios, with a significant mismatch between synthetic and real-world data. The findings suggest that future research should focus on broadening the range of programming languages, refining training objectives, and improving model resilience and usability.