2024-07-19 | Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, Vittorio Castelli
The paper introduces Long-form RobustQA (LFRQA), a new dataset designed to evaluate retrieval-augmented generative question answering (RAG-QA) systems on cross-domain generalization. LFRQA consists of 26K queries and covers seven different domains, with human-written long-form answers that integrate short extractive answers from multiple documents into coherent narratives. To address the limitations of existing datasets, which often use single-source corpora or short extractive answers, LFRQA provides a more comprehensive and realistic evaluation scenario.
The authors propose RAG-QA ARENA, an evaluation framework that directly compares model-generated answers against LFRQA's answers using large language models (LLMs) as evaluators. Extensive experiments show that RAG-QA ARENA and human judgments on answer quality are highly correlated. However, only 41.3% of the best LLMs' answers are preferred to LFRQA's answers, highlighting the challenge of evaluating RAG-QA systems on cross-domain tasks.
The paper also discusses the limitations of existing datasets and the advantages of LFRQA, including its comprehensive and coherent long-form answers. It provides a detailed description of the data creation process, including annotation instructions and quality control mechanisms. The evaluation framework is designed to be efficient and accurate, and the authors demonstrate its effectiveness through various experiments and comparisons with human judgments.
Finally, the paper concludes by emphasizing the value of LFRQA and RAG-QA ARENA for future RAG-QA research, noting that the evaluation framework can be extended to study the impact of different retrievers or joint retriever and LLM training.The paper introduces Long-form RobustQA (LFRQA), a new dataset designed to evaluate retrieval-augmented generative question answering (RAG-QA) systems on cross-domain generalization. LFRQA consists of 26K queries and covers seven different domains, with human-written long-form answers that integrate short extractive answers from multiple documents into coherent narratives. To address the limitations of existing datasets, which often use single-source corpora or short extractive answers, LFRQA provides a more comprehensive and realistic evaluation scenario.
The authors propose RAG-QA ARENA, an evaluation framework that directly compares model-generated answers against LFRQA's answers using large language models (LLMs) as evaluators. Extensive experiments show that RAG-QA ARENA and human judgments on answer quality are highly correlated. However, only 41.3% of the best LLMs' answers are preferred to LFRQA's answers, highlighting the challenge of evaluating RAG-QA systems on cross-domain tasks.
The paper also discusses the limitations of existing datasets and the advantages of LFRQA, including its comprehensive and coherent long-form answers. It provides a detailed description of the data creation process, including annotation instructions and quality control mechanisms. The evaluation framework is designed to be efficient and accurate, and the authors demonstrate its effectiveness through various experiments and comparisons with human judgments.
Finally, the paper concludes by emphasizing the value of LFRQA and RAG-QA ARENA for future RAG-QA research, noting that the evaluation framework can be extended to study the impact of different retrievers or joint retriever and LLM training.