Calibrating Large Language Models with Sample Consistency

Calibrating Large Language Models with Sample Consistency

21 Feb 2024 | Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-Burch
This paper proposes a method to calibrate the confidence of Large Language Models (LLMs) by analyzing the consistency of multiple generations. The authors evaluate three consistency measures: agreement-based, entropy-based, and first-second-distance-based (FSD). These measures are applied to both open-source and closed-source LLMs across nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches. Factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Confidence scores derived from consistency have the potential to improve model performance. The authors provide practical guidance on selecting suitable consistency metrics based on the characteristics of different LLMs. They also find that explanation-based prompting strategies improve calibration, with FCoT (Faithful Chain of Thought) being particularly effective for GPT models. The study highlights the importance of consistency in improving model reliability and performance, and provides a framework for choosing appropriate calibration methods. The results suggest that consistency-based calibration is a promising approach for improving the reliability of LLMs.This paper proposes a method to calibrate the confidence of Large Language Models (LLMs) by analyzing the consistency of multiple generations. The authors evaluate three consistency measures: agreement-based, entropy-based, and first-second-distance-based (FSD). These measures are applied to both open-source and closed-source LLMs across nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches. Factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Confidence scores derived from consistency have the potential to improve model performance. The authors provide practical guidance on selecting suitable consistency metrics based on the characteristics of different LLMs. They also find that explanation-based prompting strategies improve calibration, with FCoT (Faithful Chain of Thought) being particularly effective for GPT models. The study highlights the importance of consistency in improving model reliability and performance, and provides a framework for choosing appropriate calibration methods. The results suggest that consistency-based calibration is a promising approach for improving the reliability of LLMs.
Reach us at info@study.space