Calibrating Large Language Models with Sample Consistency

Calibrating Large Language Models with Sample Consistency

21 Feb 2024 | Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-Burch
This paper explores the calibration of Large Language Models (LLMs) by leveraging the consistency of multiple randomly sampled model generations. The authors investigate three measures of consistency: agreement-based, entropy-based, and first-second-distance-based (FSD). They perform extensive evaluations on various open and closed-source models across nine reasoning datasets. The results show that consistency-based calibration methods outperform existing post-hoc approaches. Factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. The paper also demonstrates that confidence scores obtained from consistency can improve model performance. Finally, it provides practical guidance on selecting suitable consistency metrics based on the characteristics of different LMs.This paper explores the calibration of Large Language Models (LLMs) by leveraging the consistency of multiple randomly sampled model generations. The authors investigate three measures of consistency: agreement-based, entropy-based, and first-second-distance-based (FSD). They perform extensive evaluations on various open and closed-source models across nine reasoning datasets. The results show that consistency-based calibration methods outperform existing post-hoc approaches. Factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. The paper also demonstrates that confidence scores obtained from consistency can improve model performance. Finally, it provides practical guidance on selecting suitable consistency metrics based on the characteristics of different LMs.
Reach us at info@study.space
[slides and audio] Calibrating Large Language Models with Sample Consistency