Benchmarking LLMs via Uncertainty Quantification

Benchmarking LLMs via Uncertainty Quantification

2024 | Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, Zhaopeng Tu
This paper introduces a new benchmarking approach for Large Language Models (LLMs) that integrates uncertainty quantification. The authors evaluate nine LLMs across five natural language processing tasks: question answering, reading comprehension, commonsense inference, dialogue response selection, and document summarization. They use conformal prediction, a distribution-free and model-agnostic method, to quantify uncertainty. Conformal prediction provides a statistically rigorous way to measure uncertainty by generating prediction sets that guarantee coverage of the true label with a specified probability. The results show that higher accuracy does not necessarily correlate with lower uncertainty, and larger LLMs may exhibit greater uncertainty than smaller ones. Instruction-finetuning tends to increase uncertainty. The study highlights the importance of incorporating uncertainty into LLM evaluation, as current benchmarks often neglect this aspect. The implementation is available at https://github.com/smartyfh/LLM-Uncertainty-Bench. The authors also compare their approach to other uncertainty quantification methods and demonstrate its effectiveness in evaluating both open-source and closed-source LLMs. The findings suggest that conformal prediction provides a robust and systematic way to assess LLM uncertainty.This paper introduces a new benchmarking approach for Large Language Models (LLMs) that integrates uncertainty quantification. The authors evaluate nine LLMs across five natural language processing tasks: question answering, reading comprehension, commonsense inference, dialogue response selection, and document summarization. They use conformal prediction, a distribution-free and model-agnostic method, to quantify uncertainty. Conformal prediction provides a statistically rigorous way to measure uncertainty by generating prediction sets that guarantee coverage of the true label with a specified probability. The results show that higher accuracy does not necessarily correlate with lower uncertainty, and larger LLMs may exhibit greater uncertainty than smaller ones. Instruction-finetuning tends to increase uncertainty. The study highlights the importance of incorporating uncertainty into LLM evaluation, as current benchmarks often neglect this aspect. The implementation is available at https://github.com/smartyfh/LLM-Uncertainty-Bench. The authors also compare their approach to other uncertainty quantification methods and demonstrate its effectiveness in evaluating both open-source and closed-source LLMs. The findings suggest that conformal prediction provides a robust and systematic way to assess LLM uncertainty.
Reach us at info@study.space