5 Dec 2024 | Sanyam Kapoor*, Nate Gruver*, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, Andrew Gordon Wilson
Large language models (LLMs) often struggle to accurately represent uncertainty in their predictions, which is critical for high-stakes applications. This paper investigates methods to improve LLM uncertainty estimation. The authors argue that prompting alone is insufficient for good calibration and propose fine-tuning on a small dataset of correct and incorrect answers to create reliable uncertainty estimates. They show that fine-tuning with as few as 1000 examples can significantly improve calibration and generalization, and that using LoRA (Low-Rank Adaptation) allows for effective performance with minimal computational overhead. The study also reveals that many models can be used as general-purpose uncertainty estimators, not just for their own predictions but also for other models. Furthermore, the paper demonstrates that uncertainty estimates can inform human-AI collaboration through a user study, showing that users are more likely to trust calibrated uncertainty scores. The findings suggest that fine-tuning is sample-efficient and robust to distribution shifts, and that calibrated uncertainties can improve human decision-making when working with LLMs. The paper highlights the importance of uncertainty estimation for reliable LLM use in practical applications.Large language models (LLMs) often struggle to accurately represent uncertainty in their predictions, which is critical for high-stakes applications. This paper investigates methods to improve LLM uncertainty estimation. The authors argue that prompting alone is insufficient for good calibration and propose fine-tuning on a small dataset of correct and incorrect answers to create reliable uncertainty estimates. They show that fine-tuning with as few as 1000 examples can significantly improve calibration and generalization, and that using LoRA (Low-Rank Adaptation) allows for effective performance with minimal computational overhead. The study also reveals that many models can be used as general-purpose uncertainty estimators, not just for their own predictions but also for other models. Furthermore, the paper demonstrates that uncertainty estimates can inform human-AI collaboration through a user study, showing that users are more likely to trust calibrated uncertainty scores. The findings suggest that fine-tuning is sample-efficient and robust to distribution shifts, and that calibrated uncertainties can improve human decision-making when working with LLMs. The paper highlights the importance of uncertainty estimation for reliable LLM use in practical applications.