9 Jan 2024 | Negar Arabzadeh, Amin Bigdeli, and Charles L. A. Clarke
This paper introduces a method to evaluate the quality of answers generated by large language models (LLMs) using standard retrieval benchmarks. The authors propose two approaches: one based on relevance judgments from retrieval benchmarks and another comparing generated answers to top-retrieved answers from various retrieval models. The evaluation is based on the similarity between the generated answers and the embedded representations of relevant passages from the retrieval benchmarks. The authors tested their approach on the MS MARCO, TREC Deep Learning 2019, and TREC Deep Learning 2020 datasets, using a variety of LLMs, including GPT-based and open-source models. They also tested the effectiveness of "liar" prompts, which are designed to generate incorrect answers, to assess the models' ability to produce deceptive responses. The results show that the IR benchmark can serve as a reliable anchor for evaluating generated answers, and that the similarity between generated answers and judged relevant passages can be used to measure the quality of the generated answers. The authors also found that even without human judgments, a reliable retrieval pipeline can assess the quality of generated answers. The experiments demonstrate that generative models like GPT-4 and GPT-3.5-turbo perform comparably to the best retrieval-based runs in TREC DL 2019 and TREC DL 2020 when measured by similarity to relevance-judged documents. The study highlights the potential of retrieval benchmarks as a valuable tool for evaluating the performance of LLMs in generative question answering tasks.This paper introduces a method to evaluate the quality of answers generated by large language models (LLMs) using standard retrieval benchmarks. The authors propose two approaches: one based on relevance judgments from retrieval benchmarks and another comparing generated answers to top-retrieved answers from various retrieval models. The evaluation is based on the similarity between the generated answers and the embedded representations of relevant passages from the retrieval benchmarks. The authors tested their approach on the MS MARCO, TREC Deep Learning 2019, and TREC Deep Learning 2020 datasets, using a variety of LLMs, including GPT-based and open-source models. They also tested the effectiveness of "liar" prompts, which are designed to generate incorrect answers, to assess the models' ability to produce deceptive responses. The results show that the IR benchmark can serve as a reliable anchor for evaluating generated answers, and that the similarity between generated answers and judged relevant passages can be used to measure the quality of the generated answers. The authors also found that even without human judgments, a reliable retrieval pipeline can assess the quality of generated answers. The experiments demonstrate that generative models like GPT-4 and GPT-3.5-turbo perform comparably to the best retrieval-based runs in TREC DL 2019 and TREC DL 2020 when measured by similarity to relevance-judged documents. The study highlights the potential of retrieval benchmarks as a valuable tool for evaluating the performance of LLMs in generative question answering tasks.