25 May 2024 | Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen
This paper introduces Sparse RAG, a novel approach to accelerate inference in retrieval-augmented generation (RAG) systems. Traditional RAG systems suffer from increased latency due to the linear growth of input length with the number of retrieved documents. Sparse RAG addresses this by reducing the number of documents processed during decoding through a sparse mechanism that selectively retains only highly relevant contexts. This approach combines the assessment of each document with the generation of the response in a single process, using special control tokens to guide the model's attention.
Sparse RAG operates by first pre-filling the key-value cache with all retrieved documents, then selectively decoding only the most relevant ones. This method significantly reduces the number of documents that need to be attended to during decoding, thereby improving inference efficiency. The system also filters out undesirable contexts, enhancing the model's focus on relevant information and improving generation quality.
The paper evaluates Sparse RAG on two datasets: PopQA and QMSum. Results show that Sparse RAG achieves a better balance between generation quality and computational efficiency compared to standard dense-RAG and PCW-RAG approaches. It demonstrates superior performance in both short- and long-form generation tasks, with significantly faster decoding speeds and higher quality outputs.
The approach is also tested on different LLM sizes, showing compatibility with various foundation models. Additionally, the paper explores the impact of confidence thresholds, silver labels, and prefill document numbers on performance. The results indicate that Sparse RAG maintains high quality and efficiency across different configurations, making it a versatile solution for RAG systems. The study highlights the effectiveness of sparse context selection in improving the performance of large language models in retrieval-augmented generation tasks.This paper introduces Sparse RAG, a novel approach to accelerate inference in retrieval-augmented generation (RAG) systems. Traditional RAG systems suffer from increased latency due to the linear growth of input length with the number of retrieved documents. Sparse RAG addresses this by reducing the number of documents processed during decoding through a sparse mechanism that selectively retains only highly relevant contexts. This approach combines the assessment of each document with the generation of the response in a single process, using special control tokens to guide the model's attention.
Sparse RAG operates by first pre-filling the key-value cache with all retrieved documents, then selectively decoding only the most relevant ones. This method significantly reduces the number of documents that need to be attended to during decoding, thereby improving inference efficiency. The system also filters out undesirable contexts, enhancing the model's focus on relevant information and improving generation quality.
The paper evaluates Sparse RAG on two datasets: PopQA and QMSum. Results show that Sparse RAG achieves a better balance between generation quality and computational efficiency compared to standard dense-RAG and PCW-RAG approaches. It demonstrates superior performance in both short- and long-form generation tasks, with significantly faster decoding speeds and higher quality outputs.
The approach is also tested on different LLM sizes, showing compatibility with various foundation models. Additionally, the paper explores the impact of confidence thresholds, silver labels, and prefill document numbers on performance. The results indicate that Sparse RAG maintains high quality and efficiency across different configurations, making it a versatile solution for RAG systems. The study highlights the effectiveness of sparse context selection in improving the performance of large language models in retrieval-augmented generation tasks.