Linguistic Calibration of Long-Form Generations

Linguistic Calibration of Long-Form Generations

4 Jun 2024 | Neil Band, Xuechen Li, Tengyu Ma, Tatsunori Hashimoto
Linguistic calibration of long-form generations aims to improve the reliability of language models (LMs) by enabling them to produce confidence statements that align with the likelihood of their claims being correct. This is crucial because confident hallucinations can lead to poor downstream decisions. The paper proposes a training framework that combines supervised finetuning and reinforcement learning to calibrate LM generations, allowing users to make calibrated probabilistic forecasts. The framework is applied to Llama 2 7B, showing significant improvements in calibration compared to strong factuality baselines, with comparable accuracy. The results generalize across different domains, including scientific and biomedical questions, and to a held-out biography generation task. The key idea is to construct an objective in the space of user forecasts, enabling end-to-end calibration of long-form generations. The method is evaluated on multiple datasets, demonstrating its effectiveness in improving calibration and accuracy, even in out-of-distribution settings. The paper also highlights the importance of decision-based reinforcement learning in achieving this goal, as it directly optimizes for downstream decision-making. The results show that linguistic calibration can be achieved without requiring human feedback, by using surrogate readers and proper scoring rules. The framework is computationally efficient and generalizes well to new tasks, making it a promising approach for improving the reliability of LM-generated text.Linguistic calibration of long-form generations aims to improve the reliability of language models (LMs) by enabling them to produce confidence statements that align with the likelihood of their claims being correct. This is crucial because confident hallucinations can lead to poor downstream decisions. The paper proposes a training framework that combines supervised finetuning and reinforcement learning to calibrate LM generations, allowing users to make calibrated probabilistic forecasts. The framework is applied to Llama 2 7B, showing significant improvements in calibration compared to strong factuality baselines, with comparable accuracy. The results generalize across different domains, including scientific and biomedical questions, and to a held-out biography generation task. The key idea is to construct an objective in the space of user forecasts, enabling end-to-end calibration of long-form generations. The method is evaluated on multiple datasets, demonstrating its effectiveness in improving calibration and accuracy, even in out-of-distribution settings. The paper also highlights the importance of decision-based reinforcement learning in achieving this goal, as it directly optimizes for downstream decision-making. The results show that linguistic calibration can be achieved without requiring human feedback, by using surrogate readers and proper scoring rules. The framework is computationally efficient and generalizes well to new tasks, making it a promising approach for improving the reliability of LM-generated text.
Reach us at info@study.space
[slides] Linguistic Calibration of Long-Form Generations | StudySpace