Understanding FeB4RAG%3A Evaluating Federated Search in the Context of Retrieval Augmented Generation

FeB4RAG is a new dataset designed for evaluating federated search within Retrieval Augmented Generation (RAG) pipelines. Federated search systems aggregate results from multiple sources to enhance result quality and align with user intent. However, existing datasets, such as those from TREC FedWeb tracks, lack representation of modern information retrieval challenges. FeB4RAG addresses this by incorporating 790 information requests tailored for chatbot applications, along with top results from each resource and LLM-derived relevance judgements. It also demonstrates the impact of a high-quality federated search system on response generation compared to a naive approach. The dataset supports the development and evaluation of new federated search methods in the context of RAG pipelines. FeB4RAG is built from 16 sub-collections of the BEIR benchmark, each paired with a state-of-the-art retrieval model. The dataset includes diverse user requests generated through a structured process, and relevance labels are created using LLMs, which show high agreement with human annotations. The dataset is used to evaluate federated search methods, including resource selection and result merging. The results show that naive federated search strategies are far from optimal, highlighting the need for effective methods in RAG pipelines. FeB4RAG provides a reliable and expandable framework for evaluating federated search in RAG systems.FeB4RAG is a new dataset designed for evaluating federated search within Retrieval Augmented Generation (RAG) pipelines. Federated search systems aggregate results from multiple sources to enhance result quality and align with user intent. However, existing datasets, such as those from TREC FedWeb tracks, lack representation of modern information retrieval challenges. FeB4RAG addresses this by incorporating 790 information requests tailored for chatbot applications, along with top results from each resource and LLM-derived relevance judgements. It also demonstrates the impact of a high-quality federated search system on response generation compared to a naive approach. The dataset supports the development and evaluation of new federated search methods in the context of RAG pipelines. FeB4RAG is built from 16 sub-collections of the BEIR benchmark, each paired with a state-of-the-art retrieval model. The dataset includes diverse user requests generated through a structured process, and relevance labels are created using LLMs, which show high agreement with human annotations. The dataset is used to evaluate federated search methods, including resource selection and result merging. The results show that naive federated search strategies are far from optimal, highlighting the need for effective methods in RAG pipelines. FeB4RAG provides a reliable and expandable framework for evaluating federated search in RAG systems.

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation

February 2024 | SHUAI WANG, EKATERINA KHRAMTSOVA, SHENGYAO ZHUANG, GUIDO ZUCCON