9 Mar 2024 | Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdo Yun, Seong Joon Oh
This paper introduces APRICOT, a method to calibrate large language models (LLMs) using only their generated text. APRICOT uses an auxiliary model to predict the confidence of an LLM based on its input and output. The auxiliary model is trained using clustering techniques to create calibration targets without requiring access to the LLM's internal states or parameters. This approach is conceptually simple, does not interfere with the LLM's generation process, and can be applied to both white-box and black-box LLMs. The method is evaluated on closed-book question-answering tasks and shows competitive performance in terms of calibration error. APRICOT outperforms other methods in terms of calibration accuracy and provides a reliable way to assess the confidence of LLM responses. The method is also shown to be effective in identifying incorrect answers and improving the reliability of LLM outputs. The paper also discusses the limitations of the approach, including the potential for distributional shift and the need for further research on the calibration of pre-trained language models. Overall, APRICOT provides a practical solution for calibrating LLMs and improving their reliability in real-world applications.This paper introduces APRICOT, a method to calibrate large language models (LLMs) using only their generated text. APRICOT uses an auxiliary model to predict the confidence of an LLM based on its input and output. The auxiliary model is trained using clustering techniques to create calibration targets without requiring access to the LLM's internal states or parameters. This approach is conceptually simple, does not interfere with the LLM's generation process, and can be applied to both white-box and black-box LLMs. The method is evaluated on closed-book question-answering tasks and shows competitive performance in terms of calibration error. APRICOT outperforms other methods in terms of calibration accuracy and provides a reliable way to assess the confidence of LLM responses. The method is also shown to be effective in identifying incorrect answers and improving the reliability of LLM outputs. The paper also discusses the limitations of the approach, including the potential for distributional shift and the need for further research on the calibration of pre-trained language models. Overall, APRICOT provides a practical solution for calibrating LLMs and improving their reliability in real-world applications.