Understanding Are Large Language Models Good at Utility Judgments%3F

This paper investigates the capabilities of large language models (LLMs) in utility evaluation for open-domain question answering (QA). The authors conduct a comprehensive study to assess whether LLMs can distinguish between relevance and utility, and how utility judgments impact their QA performance. They introduce a benchmarking procedure and a collection of candidate passages with different characteristics, facilitating experiments with five representative LLMs. The key findings include: 1. **Utility vs. Relevance**: LLMs can distinguish between utility and relevance, with utility judgments offering more valuable guidance for identifying ground-truth evidence. 2. **Influence of Instruction Design**: The input form of passages (pointwise, pairwise, listwise) and the sequence of input between the question and passages significantly affect utility judgments. 3. **Impact on QA Performance**: Utility judgments improve the performance of answer generation compared to relevance judgments, with the listwise-set input form outperforming other forms on the MSMARCO-QA dataset. 4. **k-Sampling Approach**: To reduce LLMs' dependency on the position of ground-truth evidence, the authors propose a k-sampling listwise approach, which improves answer generation performance. The study contributes to a critical assessment of retrieval-augmented LLMs and provides insights into enhancing their utility judgment capabilities. The code and benchmark are available at <https://github.com/ict-bigdatalab/utility_judgments>.This paper investigates the capabilities of large language models (LLMs) in utility evaluation for open-domain question answering (QA). The authors conduct a comprehensive study to assess whether LLMs can distinguish between relevance and utility, and how utility judgments impact their QA performance. They introduce a benchmarking procedure and a collection of candidate passages with different characteristics, facilitating experiments with five representative LLMs. The key findings include: 1. **Utility vs. Relevance**: LLMs can distinguish between utility and relevance, with utility judgments offering more valuable guidance for identifying ground-truth evidence. 2. **Influence of Instruction Design**: The input form of passages (pointwise, pairwise, listwise) and the sequence of input between the question and passages significantly affect utility judgments. 3. **Impact on QA Performance**: Utility judgments improve the performance of answer generation compared to relevance judgments, with the listwise-set input form outperforming other forms on the MSMARCO-QA dataset. 4. **k-Sampling Approach**: To reduce LLMs' dependency on the position of ground-truth evidence, the authors propose a k-sampling listwise approach, which improves answer generation performance. The study contributes to a critical assessment of retrieval-augmented LLMs and provides insights into enhancing their utility judgment capabilities. The code and benchmark are available at <https://github.com/ict-bigdatalab/utility_judgments>.

Are Large Language Models Good at Utility Judgments?

July 14–18, 2024, Washington, DC, USA | Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng