[slides and audio] Calibration and Correctness of Language Models for Code

This paper addresses the calibration of language models (LLMs) used for generating code, a critical issue given the increasing integration of LLMs into software engineering practices. The authors evaluate the calibration of LLMs on tasks such as code completion, function synthesis, and program repair, using various correctness criteria and datasets. They find that, generally, LLMs are not well-calibrated out-of-the-box, with high Expected Calibration Error (ECE) values ranging from 0.09 to 0.73. To improve calibration, they explore methods like Platt scaling and reflective prompting, finding that while Platt scaling can significantly enhance calibration, it requires a sufficient amount of correctness data and can lead to "bucket collapse," where all samples are placed in a single confidence bucket. Reflective prompting, on the other hand, shows mixed results, with some models performing better without rescaling. The authors also investigate few-shot techniques, finding that using BM25-aided few-shot learning can achieve a high Skill Score of 0.15 for line completion, significantly improving over the baseline. The paper concludes by discussing the importance of calibration in managing the risks associated with LLM-generated code and the potential for future research to further improve calibration methods.This paper addresses the calibration of language models (LLMs) used for generating code, a critical issue given the increasing integration of LLMs into software engineering practices. The authors evaluate the calibration of LLMs on tasks such as code completion, function synthesis, and program repair, using various correctness criteria and datasets. They find that, generally, LLMs are not well-calibrated out-of-the-box, with high Expected Calibration Error (ECE) values ranging from 0.09 to 0.73. To improve calibration, they explore methods like Platt scaling and reflective prompting, finding that while Platt scaling can significantly enhance calibration, it requires a sufficient amount of correctness data and can lead to "bucket collapse," where all samples are placed in a single confidence bucket. Reflective prompting, on the other hand, shows mixed results, with some models performing better without rescaling. The authors also investigate few-shot techniques, finding that using BM25-aided few-shot learning can achieve a high Skill Score of 0.15 for line completion, significantly improving over the baseline. The paper concludes by discussing the importance of calibration in managing the risks associated with LLM-generated code and the potential for future research to further improve calibration methods.

Calibration and Correctness of Language Models for Code

21 Aug 2024 | Claudio Spiess*, David Gros*, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, Toufique Ahmed

21 Aug 2024 | Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, Toufique Ahmed