22 May 2024 | Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, Dongyan Zhao
xRAG is an innovative context compression method for retrieval-augmented generation (RAG). It reinterprets document embeddings from dense retrieval as features from the retrieval modality, enabling extreme compression by integrating these embeddings into the language model's representation space. The only trainable component is the modality bridge, while the retriever and language model remain frozen, allowing for efficient reuse of offline-constructed embeddings and preserving the plug-and-play nature of RAG. xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks, outperforming previous compression methods and matching the performance of uncompressed models on several datasets while reducing overall FLOPs by a factor of 3.53. The method leverages modality fusion, enabling efficient retrieval augmentation by adding just one document token. xRAG demonstrates robustness in handling noisy or irrelevant context, and its design minimizes computational and memory overhead. Experimental results show that xRAG significantly outperforms RAG in terms of computational efficiency, achieving a 1.64x increase in CUDA Time efficiency and a 3.53x reduction in GFLOPs. The method also shows strong performance in tasks requiring document understanding, though it lags in multi-hop reasoning tasks. xRAG's effectiveness stems from its ability to maintain an unbiased stance toward the internal knowledge representation of the LLM, enhancing resilience against misleading information. The framework is versatile, adaptable to various language model backbones, and demonstrates significant improvements in performance and efficiency.xRAG is an innovative context compression method for retrieval-augmented generation (RAG). It reinterprets document embeddings from dense retrieval as features from the retrieval modality, enabling extreme compression by integrating these embeddings into the language model's representation space. The only trainable component is the modality bridge, while the retriever and language model remain frozen, allowing for efficient reuse of offline-constructed embeddings and preserving the plug-and-play nature of RAG. xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks, outperforming previous compression methods and matching the performance of uncompressed models on several datasets while reducing overall FLOPs by a factor of 3.53. The method leverages modality fusion, enabling efficient retrieval augmentation by adding just one document token. xRAG demonstrates robustness in handling noisy or irrelevant context, and its design minimizes computational and memory overhead. Experimental results show that xRAG significantly outperforms RAG in terms of computational efficiency, achieving a 1.64x increase in CUDA Time efficiency and a 3.53x reduction in GFLOPs. The method also shows strong performance in tasks requiring document understanding, though it lags in multi-hop reasoning tasks. xRAG's effectiveness stems from its ability to maintain an unbiased stance toward the internal knowledge representation of the LLM, enhancing resilience against misleading information. The framework is versatile, adaptable to various language model backbones, and demonstrates significant improvements in performance and efficiency.