July 14-18, 2024 | Florin Cucanu*, Giovanni Trappolini*, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, Fabrizio Silvestri
The Power of Noise: Redefining Retrieval for RAG Systems
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating retrieved documents to improve factual accuracy. This paper investigates the impact of different document types on RAG performance, focusing on the retrieval strategy. The study reveals that high-scoring documents not directly relevant to the query can negatively affect LLM effectiveness. Surprisingly, adding random documents to the prompt improves LLM accuracy by up to 35%. The research explores three document types: relevant, distracting, and random. It finds that relevant documents are crucial for accurate responses, while distracting documents can mislead the model. Random documents, when properly positioned, enhance performance. The study also shows that the position of the gold document within the prompt significantly affects LLM effectiveness, with the gold document placed near the query yielding the best results. Additionally, the paper demonstrates that adding random documents, even when they are not directly relevant, can improve LLM accuracy. The findings suggest that a balance between relevant and random documents is necessary for optimal RAG performance. The study highlights the need for further research into how these findings can be applied to improve RAG systems. The results indicate that random documents, when correctly positioned, can enhance LLM accuracy, challenging the common perception that only relevant documents are beneficial. The paper concludes that there is a trade-off between the number of relevant and random documents, and that a minimal set of relevant documents supplemented with random documents up to the context limit is most effective. The study also shows that the addition of random documents can increase LLM accuracy, even when they are not directly relevant to the query. The findings suggest that the retrieval process should be carefully designed to include a mix of relevant and random documents to optimize LLM performance. The paper emphasizes the importance of understanding the role of random documents in RAG systems and the need for further research into how these findings can be applied to improve RAG systems.The Power of Noise: Redefining Retrieval for RAG Systems
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating retrieved documents to improve factual accuracy. This paper investigates the impact of different document types on RAG performance, focusing on the retrieval strategy. The study reveals that high-scoring documents not directly relevant to the query can negatively affect LLM effectiveness. Surprisingly, adding random documents to the prompt improves LLM accuracy by up to 35%. The research explores three document types: relevant, distracting, and random. It finds that relevant documents are crucial for accurate responses, while distracting documents can mislead the model. Random documents, when properly positioned, enhance performance. The study also shows that the position of the gold document within the prompt significantly affects LLM effectiveness, with the gold document placed near the query yielding the best results. Additionally, the paper demonstrates that adding random documents, even when they are not directly relevant, can improve LLM accuracy. The findings suggest that a balance between relevant and random documents is necessary for optimal RAG performance. The study highlights the need for further research into how these findings can be applied to improve RAG systems. The results indicate that random documents, when correctly positioned, can enhance LLM accuracy, challenging the common perception that only relevant documents are beneficial. The paper concludes that there is a trade-off between the number of relevant and random documents, and that a minimal set of relevant documents supplemented with random documents up to the context limit is most effective. The study also shows that the addition of random documents can increase LLM accuracy, even when they are not directly relevant to the query. The findings suggest that the retrieval process should be carefully designed to include a mix of relevant and random documents to optimize LLM performance. The paper emphasizes the importance of understanding the role of random documents in RAG systems and the need for further research into how these findings can be applied to improve RAG systems.