Understanding Multicalibration for Confidence Scoring in LLMs

This paper introduces the use of "multicalibration" to generate interpretable and reliable confidence scores for outputs from large language models (LLMs). Multicalibration requires calibration not only marginally but also across various intersecting groupings of the data. The authors propose two techniques to form these groupings: clustering within an embedding space and "self-annotation" through querying the LLM with yes-or-no questions about the prompt. They also develop novel variants of multicalibration algorithms to reduce overfitting and improve performance. Through systematic benchmarking across various question answering datasets and LLMs, the paper demonstrates that their techniques yield confidence scores that significantly enhance both calibration and accuracy compared to existing methods. The contributions of the paper are threefold: applying multicalibration to hallucination detection in LLMs, introducing novel multicalibration methods, and evaluating these techniques across diverse LLMs and datasets.This paper introduces the use of "multicalibration" to generate interpretable and reliable confidence scores for outputs from large language models (LLMs). Multicalibration requires calibration not only marginally but also across various intersecting groupings of the data. The authors propose two techniques to form these groupings: clustering within an embedding space and "self-annotation" through querying the LLM with yes-or-no questions about the prompt. They also develop novel variants of multicalibration algorithms to reduce overfitting and improve performance. Through systematic benchmarking across various question answering datasets and LLMs, the paper demonstrates that their techniques yield confidence scores that significantly enhance both calibration and accuracy compared to existing methods. The contributions of the paper are threefold: applying multicalibration to hallucination detection in LLMs, introducing novel multicalibration methods, and evaluating these techniques across diverse LLMs and datasets.

Multicalibration for Confidence Scoring in LLMs

6 Apr 2024 | Gianluca Detommaso, Martin Bertran, Riccardo Fogliato, Aaron Roth