July 14–18, 2024 | Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Are Large Language Models Good at Utility Judgments?
This paper investigates whether large language models (LLMs) are capable of making utility judgments in open-domain question answering (QA). The study introduces a benchmarking procedure and a collection of candidate passages with varying characteristics to evaluate the utility judgment capabilities of five representative LLMs. The results show that LLMs can distinguish between relevance and utility, and that utility judgments may provide more valuable guidance than relevance judgments in identifying ground-truth evidence. Additionally, the study finds that the performance of LLMs in utility judgments is influenced by factors such as the input form, the sequence of input passages, and additional requirements like chain-of-thought reasoning. The study also proposes a k-sampling listwise approach to reduce the dependency of LLMs on the position of ground-truth evidence, thereby improving subsequent answer generation. The findings suggest that utility judgments can enhance the performance of answer generation compared to relevance judgments, and that the use of utility judgments in retrieval-augmented LLMs can lead to better QA performance. The study highlights the importance of evaluating the utility of passages in supporting question answering and provides insights into the capabilities of LLMs in making utility judgments. The results indicate that LLMs can effectively distinguish between relevance and utility, and that the performance of LLMs in utility judgments is influenced by various factors, including the input form and the sequence of input passages. The study concludes that utility judgments can improve the performance of answer generation in retrieval-augmented LLMs and that further research is needed to explore the capabilities of LLMs in making utility judgments.Are Large Language Models Good at Utility Judgments?
This paper investigates whether large language models (LLMs) are capable of making utility judgments in open-domain question answering (QA). The study introduces a benchmarking procedure and a collection of candidate passages with varying characteristics to evaluate the utility judgment capabilities of five representative LLMs. The results show that LLMs can distinguish between relevance and utility, and that utility judgments may provide more valuable guidance than relevance judgments in identifying ground-truth evidence. Additionally, the study finds that the performance of LLMs in utility judgments is influenced by factors such as the input form, the sequence of input passages, and additional requirements like chain-of-thought reasoning. The study also proposes a k-sampling listwise approach to reduce the dependency of LLMs on the position of ground-truth evidence, thereby improving subsequent answer generation. The findings suggest that utility judgments can enhance the performance of answer generation compared to relevance judgments, and that the use of utility judgments in retrieval-augmented LLMs can lead to better QA performance. The study highlights the importance of evaluating the utility of passages in supporting question answering and provides insights into the capabilities of LLMs in making utility judgments. The results indicate that LLMs can effectively distinguish between relevance and utility, and that the performance of LLMs in utility judgments is influenced by various factors, including the input form and the sequence of input passages. The study concludes that utility judgments can improve the performance of answer generation in retrieval-augmented LLMs and that further research is needed to explore the capabilities of LLMs in making utility judgments.