Understanding When to Trust LLMs%3A Aligning Confidence with Response Quality

The paper "When to Trust LLMs: Aligning Confidence with Response Quality" addresses the challenge of ensuring that large language models (LLMs) produce reliable and trustworthy responses, especially in safety-critical domains. Despite the success of LLMs in natural language generation, they often generate incorrect or nonsensical text, leading to concerns about their reliability. The authors propose a novel approach called CONfidence-Quality-ORDer-preserving alignment (CONQORD), which uses reinforcement learning to align the confidence expressed by the model with the quality of its responses. CONQORD introduces a dual-component reward function: a quality reward that assesses the quality of the response, and an order-preserving alignment reward that ensures that higher-quality responses are associated with higher confidence levels. This approach aims to prevent the model from generating low-quality responses with overly high confidence, which can mislead users. Experiments on four foundational models (Llama-2 7B, Zephyr 7B, Mistral 7B, and Llama-2 13B) across two datasets (NQ and TruthfulQA) demonstrate that CONQORD significantly improves the alignment between confidence and response quality without causing over-cautious behavior. The calibrated confidence provided by CONQORD can also be used to guide the retrieval of external knowledge, enhancing the reliability of the model's outputs. The paper highlights the importance of aligning confidence with response quality to ensure more transparent and reliable responses, making LLMs more trustworthy in critical applications.The paper "When to Trust LLMs: Aligning Confidence with Response Quality" addresses the challenge of ensuring that large language models (LLMs) produce reliable and trustworthy responses, especially in safety-critical domains. Despite the success of LLMs in natural language generation, they often generate incorrect or nonsensical text, leading to concerns about their reliability. The authors propose a novel approach called CONfidence-Quality-ORDer-preserving alignment (CONQORD), which uses reinforcement learning to align the confidence expressed by the model with the quality of its responses. CONQORD introduces a dual-component reward function: a quality reward that assesses the quality of the response, and an order-preserving alignment reward that ensures that higher-quality responses are associated with higher confidence levels. This approach aims to prevent the model from generating low-quality responses with overly high confidence, which can mislead users. Experiments on four foundational models (Llama-2 7B, Zephyr 7B, Mistral 7B, and Llama-2 13B) across two datasets (NQ and TruthfulQA) demonstrate that CONQORD significantly improves the alignment between confidence and response quality without causing over-cautious behavior. The calibrated confidence provided by CONQORD can also be used to guide the retrieval of external knowledge, enhancing the reliability of the model's outputs. The paper highlights the importance of aligning confidence with response quality to ensure more transparent and reliable responses, making LLMs more trustworthy in critical applications.

When to Trust LLMs: Aligning Confidence with Response Quality

9 Jun 2024 | Shuchang Tao, Liuyi Yao, Hanxing Ding, Yuexiang Xie, Qi Cao, Fei Sun, Jinyang Gao, Huawei Shen, Bolin Ding