This paper investigates the optimal practices for implementing retrieval-augmented generation (RAG) to enhance the quality and reliability of content produced by large language models (LLMs). The authors systematically evaluate various methods for each module within the RAG framework and recommend the most effective approach for each module. They introduce a comprehensive evaluation benchmark for RAG systems and conduct extensive experiments to determine the best practices among various alternatives. The key contributions of the study include:
1. **Query Classification**: Enhances accuracy and reduces latency.
2. **Retrieval**: The "Hybrid with HyDE" method achieves the highest RAG score but at a computational cost. The "Hybrid" or "Original" methods are recommended for balancing performance and efficiency.
3. **Reranking**: The absence of a reranking module significantly impacts performance. MonoT5 is recommended for its efficacy in augmenting the relevance of retrieved documents.
4. **Repacking**: The Reverse configuration performs best, achieving an RAG score of 0.560.
5. **Summarization**: Recomp demonstrates superior performance, although removing the summarization module can achieve comparable results with lower latency.
The authors also propose two distinct practices for implementing RAG systems:
- **Best Performance Practice**: Incorporates query classification, uses the "Hybrid with HyDE" method for retrieval, employs monoT5 for reranking, opt for Reverse for repacking, and leverages Recomp for summarization.
- **Balanced Efficiency Practice**: Incorporates query classification, implements the Hybrid method for retrieval, uses TILDEv2 for reranking, opt for Reverse for repacking, and employs Recomp for summarization.
Additionally, the study extends RAG to multimodal applications, incorporating text-to-image and image-to-text retrieval capabilities, which offer advantages in groundedness, efficiency, and maintainability. The paper concludes with a discussion on the limitations and future directions, emphasizing the need to explore joint training of retrievers and generators and expand the application to other modalities such as speech and video.This paper investigates the optimal practices for implementing retrieval-augmented generation (RAG) to enhance the quality and reliability of content produced by large language models (LLMs). The authors systematically evaluate various methods for each module within the RAG framework and recommend the most effective approach for each module. They introduce a comprehensive evaluation benchmark for RAG systems and conduct extensive experiments to determine the best practices among various alternatives. The key contributions of the study include:
1. **Query Classification**: Enhances accuracy and reduces latency.
2. **Retrieval**: The "Hybrid with HyDE" method achieves the highest RAG score but at a computational cost. The "Hybrid" or "Original" methods are recommended for balancing performance and efficiency.
3. **Reranking**: The absence of a reranking module significantly impacts performance. MonoT5 is recommended for its efficacy in augmenting the relevance of retrieved documents.
4. **Repacking**: The Reverse configuration performs best, achieving an RAG score of 0.560.
5. **Summarization**: Recomp demonstrates superior performance, although removing the summarization module can achieve comparable results with lower latency.
The authors also propose two distinct practices for implementing RAG systems:
- **Best Performance Practice**: Incorporates query classification, uses the "Hybrid with HyDE" method for retrieval, employs monoT5 for reranking, opt for Reverse for repacking, and leverages Recomp for summarization.
- **Balanced Efficiency Practice**: Incorporates query classification, implements the Hybrid method for retrieval, uses TILDEv2 for reranking, opt for Reverse for repacking, and employs Recomp for summarization.
Additionally, the study extends RAG to multimodal applications, incorporating text-to-image and image-to-text retrieval capabilities, which offer advantages in groundedness, efficiency, and maintainability. The paper concludes with a discussion on the limitations and future directions, emphasizing the need to explore joint training of retrievers and generators and expand the application to other modalities such as speech and video.