PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design

PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design

8 Mar 2024 | Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, Tim Kraska
PipeRAG is a novel algorithm-system co-design approach that improves the efficiency of retrieval-augmented generation (RAG). It reduces generation latency and enhances generation quality by integrating pipeline parallelism, flexible retrieval intervals, and a performance model to balance retrieval quality and latency. The key idea of PipeRAG is to prefetch content from databases to enable pipeline parallelism between inference and retrieval systems, allowing simultaneous inference and retrieval. This reduces end-to-end generation latency and addresses hardware inefficiencies. PipeRAG also modifies the RETRO attention mechanism to support flexible retrieval intervals and uses a performance model to dynamically adjust the retrieval search space based on the latency expectation of upcoming token inferences. Evaluation shows that PipeRAG achieves up to 2.6× speedup in end-to-end generation latency while improving generation quality. These results highlight the effectiveness of algorithm-system co-design in RAG, paving the way for its adoption in future RAG systems. PipeRAG outperforms RETRO in both generation quality and efficiency, demonstrating the importance of co-designing algorithms with underlying systems. The approach is evaluated on various datasets and shows significant improvements in both generation performance and quality. The results indicate that PipeRAG can achieve comparable latency to non-retrieval models while significantly reducing perplexity, showcasing the potential of algorithm-system co-design in optimizing RAG.PipeRAG is a novel algorithm-system co-design approach that improves the efficiency of retrieval-augmented generation (RAG). It reduces generation latency and enhances generation quality by integrating pipeline parallelism, flexible retrieval intervals, and a performance model to balance retrieval quality and latency. The key idea of PipeRAG is to prefetch content from databases to enable pipeline parallelism between inference and retrieval systems, allowing simultaneous inference and retrieval. This reduces end-to-end generation latency and addresses hardware inefficiencies. PipeRAG also modifies the RETRO attention mechanism to support flexible retrieval intervals and uses a performance model to dynamically adjust the retrieval search space based on the latency expectation of upcoming token inferences. Evaluation shows that PipeRAG achieves up to 2.6× speedup in end-to-end generation latency while improving generation quality. These results highlight the effectiveness of algorithm-system co-design in RAG, paving the way for its adoption in future RAG systems. PipeRAG outperforms RETRO in both generation quality and efficiency, demonstrating the importance of co-designing algorithms with underlying systems. The approach is evaluated on various datasets and shows significant improvements in both generation performance and quality. The results indicate that PipeRAG can achieve comparable latency to non-retrieval models while significantly reducing perplexity, showcasing the potential of algorithm-system co-design in optimizing RAG.
Reach us at info@study.space