8 Mar 2024 | Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, Tim Kraska
**Abstract:**
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external token databases. However, retrievals from large databases can significantly increase generation time, especially when retrievals are periodically performed to align with the latest context. This paper introduces PipeRAG, a novel algorithm-system co-design approach to reduce generation latency and improve quality. PipeRAG integrates pipeline parallelism to enable concurrent retrieval and generation processes, flexible retrieval intervals to maximize pipeline efficiency, and a performance model to automatically balance retrieval quality and latency based on generation states and hardware capabilities. Evaluations show that PipeRAG achieves up to 2.6× speedup in end-to-end generation latency while improving generation quality, demonstrating the effectiveness of co-designing algorithms with underlying systems.
**Introduction:**
RAG enhances LLMs by conditioning on contextually relevant content retrieved from external databases. Periodic retrievals are essential to handle context shifts, but they can slow down the generation process. PipeRAG addresses this by co-designing a system-aware RAG algorithm and an algorithm-aware retrieval system. The foundation of PipeRAG is based on three performance-centric observations: hardware underutilization due to dependencies between retrievals and inferences, increasing inference time with sequence length, and trade-offs between retrieval quality and latency. PipeRAG employs pipeline parallelism to overlap retrieval and inference processes, flexible retrieval intervals to maximize pipeline efficiency, and a performance model to dynamically adjust the search space according to latency expectations.
**Background and Motivation:**
RAG improves LLMs by periodically retrieving content from large databases. Periodic retrievals ensure the retrieved content remains relevant to the evolving generation context. RETRO is a representative model that integrates a retrieval system with an inference system. However, frequent retrievals can significantly slow down the generation process. PipeRAG aims to enhance RAG efficiency by optimizing the performance-quality Pareto frontier through algorithm-system co-design.
**Our Approach:**
PipeRAG addresses hardware inefficiencies by employing pipeline parallelism and flexible retrieval intervals. It modifies RETRO's attention mechanism to support flexible retrieval intervals and uses a performance-model-driven retrieval system to dynamically balance search quality and latency. Evaluations show that PipeRAG achieves significant improvements in both generation quality and efficiency compared to RETRO.
**Evaluation:**
PipeRAG is evaluated on various datasets and large databases, demonstrating its efficiency in generation performance and quality. PipeRAG outperforms RETRO in terms of latency and perplexity, achieving up to 2.6× speedup in latency without compromising quality. The effectiveness of pipeline parallelism and flexible retrieval intervals is confirmed through ablation studies.
**Conclusion:**
PipeRAG improves RAG efficiency by adopting pipeline parallelism, flexible retrieval intervals, and dynamic adjustment of retrieval quality. It achieves up to 2.6× speedup over RETRO while maintaining or improving generation quality, establishing a solid foundation for future RAG systems.**Abstract:**
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external token databases. However, retrievals from large databases can significantly increase generation time, especially when retrievals are periodically performed to align with the latest context. This paper introduces PipeRAG, a novel algorithm-system co-design approach to reduce generation latency and improve quality. PipeRAG integrates pipeline parallelism to enable concurrent retrieval and generation processes, flexible retrieval intervals to maximize pipeline efficiency, and a performance model to automatically balance retrieval quality and latency based on generation states and hardware capabilities. Evaluations show that PipeRAG achieves up to 2.6× speedup in end-to-end generation latency while improving generation quality, demonstrating the effectiveness of co-designing algorithms with underlying systems.
**Introduction:**
RAG enhances LLMs by conditioning on contextually relevant content retrieved from external databases. Periodic retrievals are essential to handle context shifts, but they can slow down the generation process. PipeRAG addresses this by co-designing a system-aware RAG algorithm and an algorithm-aware retrieval system. The foundation of PipeRAG is based on three performance-centric observations: hardware underutilization due to dependencies between retrievals and inferences, increasing inference time with sequence length, and trade-offs between retrieval quality and latency. PipeRAG employs pipeline parallelism to overlap retrieval and inference processes, flexible retrieval intervals to maximize pipeline efficiency, and a performance model to dynamically adjust the search space according to latency expectations.
**Background and Motivation:**
RAG improves LLMs by periodically retrieving content from large databases. Periodic retrievals ensure the retrieved content remains relevant to the evolving generation context. RETRO is a representative model that integrates a retrieval system with an inference system. However, frequent retrievals can significantly slow down the generation process. PipeRAG aims to enhance RAG efficiency by optimizing the performance-quality Pareto frontier through algorithm-system co-design.
**Our Approach:**
PipeRAG addresses hardware inefficiencies by employing pipeline parallelism and flexible retrieval intervals. It modifies RETRO's attention mechanism to support flexible retrieval intervals and uses a performance-model-driven retrieval system to dynamically balance search quality and latency. Evaluations show that PipeRAG achieves significant improvements in both generation quality and efficiency compared to RETRO.
**Evaluation:**
PipeRAG is evaluated on various datasets and large databases, demonstrating its efficiency in generation performance and quality. PipeRAG outperforms RETRO in terms of latency and perplexity, achieving up to 2.6× speedup in latency without compromising quality. The effectiveness of pipeline parallelism and flexible retrieval intervals is confirmed through ablation studies.
**Conclusion:**
PipeRAG improves RAG efficiency by adopting pipeline parallelism, flexible retrieval intervals, and dynamic adjustment of retrieval quality. It achieves up to 2.6× speedup over RETRO while maintaining or improving generation quality, establishing a solid foundation for future RAG systems.