Understanding Benchmarking LLMs via Uncertainty Quantification

The paper addresses the need for comprehensive evaluation methods for Large Language Models (LLMs) by introducing a new benchmarking approach that integrates uncertainty quantification. The study involves nine LLMs across five representative natural language processing (NLP) tasks: question answering, reading comprehension, commonsense inference, dialogue response selection, and document summarization. Key findings include: 1. **Uncertainty and Accuracy**: Higher-accuracy LLMs may exhibit lower certainty, suggesting that accuracy alone is insufficient for a holistic evaluation. 2. **Model Scale**: Larger-scale LLMs tend to display greater uncertainty compared to smaller models, indicating that model size affects uncertainty. 3. **Instruction Finetuning**: Instruction-finetuning tends to increase the uncertainty of LLMs, highlighting the importance of considering uncertainty in fine-tuning processes. The paper uses conformal prediction, a distribution-free and model-agnostic method, to quantify uncertainty. This method is compared to other uncertainty quantification methods, such as entropy and maximal predicted probability, demonstrating its superiority in terms of reliability and coverage rate. The study also extends its analysis to closed-source LLMs and free-form text generation, further validating the effectiveness of the proposed approach. Overall, the research underscores the importance of incorporating uncertainty into LLM evaluation to provide a more comprehensive assessment of their performance.The paper addresses the need for comprehensive evaluation methods for Large Language Models (LLMs) by introducing a new benchmarking approach that integrates uncertainty quantification. The study involves nine LLMs across five representative natural language processing (NLP) tasks: question answering, reading comprehension, commonsense inference, dialogue response selection, and document summarization. Key findings include: 1. **Uncertainty and Accuracy**: Higher-accuracy LLMs may exhibit lower certainty, suggesting that accuracy alone is insufficient for a holistic evaluation. 2. **Model Scale**: Larger-scale LLMs tend to display greater uncertainty compared to smaller models, indicating that model size affects uncertainty. 3. **Instruction Finetuning**: Instruction-finetuning tends to increase the uncertainty of LLMs, highlighting the importance of considering uncertainty in fine-tuning processes. The paper uses conformal prediction, a distribution-free and model-agnostic method, to quantify uncertainty. This method is compared to other uncertainty quantification methods, such as entropy and maximal predicted probability, demonstrating its superiority in terms of reliability and coverage rate. The study also extends its analysis to closed-source LLMs and free-form text generation, further validating the effectiveness of the proposed approach. Overall, the research underscores the importance of incorporating uncertainty into LLM evaluation to provide a more comprehensive assessment of their performance.

Benchmarking LLMs via Uncertainty Quantification

31 Oct 2024 | Fanghua Ye1,2 Mingming Yang1 Jianhui Pang1,3 Longyue Wang1,* Derek F. Wong3 Emine Yilmaz2 Shuming Shi1 Zhaopeng Tu