[slides and audio] Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

The paper addresses the challenge of evaluating the quality and correctness of answers generated by large language models (LLMs) in question-answering tasks, particularly in the absence of established evaluation methods. The authors propose an evaluation framework that leverages existing retrieval benchmarks to assess the generated answers. They explore two main approaches: 1. **Evaluating Generated Answers Using Relevance Judgments**: The similarity between the embedded representations of generated answers and judged relevant passages from retrieval benchmarks is measured. This approach shows high correlation with widely used evaluation metrics like nDCG and effectively addresses the comparison of generated and retrieved answers in the same space. 2. **Evaluating Generated Answers Without Relevance Judgments**: The similarity between generated answers and the top-retrieved passages from various retrieval models is measured. Even without explicit relevance judgments, a reliable IR pipeline can be used to assess the quality of generated answers, demonstrating that retrieval benchmarks serve as a reliable anchor for evaluating generative question answering. The experiments conducted on datasets such as MS MARCO dev set, TREC Deep Learning 2019, and TREC Deep Learning 2020 show that generative models like gpt-4 and gpt-3.5-turbo perform comparably to the best retrieval-based runs in these datasets when measured by similarity to relevance-judged documents. The findings contribute to the field of generative question answering by providing a robust evaluation framework and emphasizing the potential of retrieval benchmarks as valuable tools for assessing LLMs.The paper addresses the challenge of evaluating the quality and correctness of answers generated by large language models (LLMs) in question-answering tasks, particularly in the absence of established evaluation methods. The authors propose an evaluation framework that leverages existing retrieval benchmarks to assess the generated answers. They explore two main approaches: 1. **Evaluating Generated Answers Using Relevance Judgments**: The similarity between the embedded representations of generated answers and judged relevant passages from retrieval benchmarks is measured. This approach shows high correlation with widely used evaluation metrics like nDCG and effectively addresses the comparison of generated and retrieved answers in the same space. 2. **Evaluating Generated Answers Without Relevance Judgments**: The similarity between generated answers and the top-retrieved passages from various retrieval models is measured. Even without explicit relevance judgments, a reliable IR pipeline can be used to assess the quality of generated answers, demonstrating that retrieval benchmarks serve as a reliable anchor for evaluating generative question answering. The experiments conducted on datasets such as MS MARCO dev set, TREC Deep Learning 2019, and TREC Deep Learning 2020 show that generative models like gpt-4 and gpt-3.5-turbo perform comparably to the best retrieval-based runs in these datasets when measured by similarity to relevance-judged documents. The findings contribute to the field of generative question answering by providing a robust evaluation framework and emphasizing the potential of retrieval benchmarks as valuable tools for assessing LLMs.

Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

9 Jan 2024 | Negar Arabzadeh, Amin Bigdeli, and Charles L. A. Clarke