RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

19 Jul 2024 | Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, Vittorio Castelli
The paper introduces Long-form RobustQA (LFRQA), a new dataset for evaluating retrieval augmented generation (RAG-QA) systems. LFRQA consists of 26,000 queries across seven domains, with answers that integrate multiple short extractive answers from different documents into a single, coherent long-form narrative. This dataset addresses the limitations of existing benchmarks like ROBUSTQA and NQ, which are not suitable for evaluating the long-form responses generated by modern large language models (LLMs). LFRQA provides high-quality, human-annotated answers that are more aligned with the output of LLMs, making it a better benchmark for assessing RAG-QA systems. To evaluate RAG-QA systems, the authors propose RAG-QA ARENA, a framework that directly compares model-generated answers with LFRQA's answers using LLMs as evaluators. The framework leverages human judgments and shows a strong correlation with human evaluations. The results indicate that only 41.3% of the most competitive LLMs' answers are preferred to LFRQA's answers, highlighting the effectiveness of LFRQA as a challenging benchmark. The paper also discusses the limitations of extractive RAG-QA, such as the inability to handle long-form responses and the potential for unfair evaluation due to token overlap metrics. LFRQA addresses these issues by requiring annotators to combine multiple short answers into a coherent long-form answer, which better reflects the capabilities of modern LLMs. The authors conducted extensive experiments using various LLMs, including GPT-4, MIXTRAL, and LLAMA, to evaluate their performance on LFRQA. The results show that GPT-4o performs best in several domains, while MIXTRAL-8X22B-INSTRUCT and GPT-4-TURBO are strong competitors. The study also highlights the importance of using a diverse set of models and domains to evaluate RAG-QA systems effectively. The paper concludes that LFRQA provides a comprehensive and challenging benchmark for evaluating RAG-QA systems, and RAG-QA ARENA offers a reliable framework for comparing model-generated answers with human-annotated answers. The authors emphasize the need for further research to improve the evaluation of RAG-QA systems and to address the challenges posed by the increasing complexity of LLMs.The paper introduces Long-form RobustQA (LFRQA), a new dataset for evaluating retrieval augmented generation (RAG-QA) systems. LFRQA consists of 26,000 queries across seven domains, with answers that integrate multiple short extractive answers from different documents into a single, coherent long-form narrative. This dataset addresses the limitations of existing benchmarks like ROBUSTQA and NQ, which are not suitable for evaluating the long-form responses generated by modern large language models (LLMs). LFRQA provides high-quality, human-annotated answers that are more aligned with the output of LLMs, making it a better benchmark for assessing RAG-QA systems. To evaluate RAG-QA systems, the authors propose RAG-QA ARENA, a framework that directly compares model-generated answers with LFRQA's answers using LLMs as evaluators. The framework leverages human judgments and shows a strong correlation with human evaluations. The results indicate that only 41.3% of the most competitive LLMs' answers are preferred to LFRQA's answers, highlighting the effectiveness of LFRQA as a challenging benchmark. The paper also discusses the limitations of extractive RAG-QA, such as the inability to handle long-form responses and the potential for unfair evaluation due to token overlap metrics. LFRQA addresses these issues by requiring annotators to combine multiple short answers into a coherent long-form answer, which better reflects the capabilities of modern LLMs. The authors conducted extensive experiments using various LLMs, including GPT-4, MIXTRAL, and LLAMA, to evaluate their performance on LFRQA. The results show that GPT-4o performs best in several domains, while MIXTRAL-8X22B-INSTRUCT and GPT-4-TURBO are strong competitors. The study also highlights the importance of using a diverse set of models and domains to evaluate RAG-QA systems effectively. The paper concludes that LFRQA provides a comprehensive and challenging benchmark for evaluating RAG-QA systems, and RAG-QA ARENA offers a reliable framework for comparing model-generated answers with human-annotated answers. The authors emphasize the need for further research to improve the evaluation of RAG-QA systems and to address the challenges posed by the increasing complexity of LLMs.
Reach us at info@study.space
Understanding RAG-QA Arena%3A Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering