22 Jul 2024 | Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, Even Oldridge
This paper introduces NV-Retriever-v1, a state-of-the-art text embedding model designed for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). The authors highlight the importance of effective hard-negative mining in the fine-tuning process of text embedding models, which is often under explored in the literature. They propose a family of positive-aware mining methods that leverage the positive relevance score to better remove false negatives, improving the accuracy of the fine-tuned models.
The paper includes a comprehensive ablation study comparing different hard-negative mining methods, teacher models, and ensemble techniques. The results show that the proposed positive-aware mining methods significantly improve the performance of text embedding models. NV-Retriever-v1, trained using these methods, achieved an average NDCG@10 score of 0.9 on the MTEB Retrieval benchmark, placing first on the leaderboard at its publication.
The key contributions of the paper are:
1. Positive-aware hard-negative mining methods that leverage the positive relevance score to remove false negatives.
2. A detailed ablation study on different hard-negative mining methods and their configurations.
3. The release of NV-Retriever-v1, a state-of-the-art text retrieval model.
The authors recommend practitioners experiment with different configurations of the proposed methods to find the best setup for their fine-tuning and evaluation tasks. They also encourage future research to disclose their mining methodologies for reproducibility and replicability.This paper introduces NV-Retriever-v1, a state-of-the-art text embedding model designed for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). The authors highlight the importance of effective hard-negative mining in the fine-tuning process of text embedding models, which is often under explored in the literature. They propose a family of positive-aware mining methods that leverage the positive relevance score to better remove false negatives, improving the accuracy of the fine-tuned models.
The paper includes a comprehensive ablation study comparing different hard-negative mining methods, teacher models, and ensemble techniques. The results show that the proposed positive-aware mining methods significantly improve the performance of text embedding models. NV-Retriever-v1, trained using these methods, achieved an average NDCG@10 score of 0.9 on the MTEB Retrieval benchmark, placing first on the leaderboard at its publication.
The key contributions of the paper are:
1. Positive-aware hard-negative mining methods that leverage the positive relevance score to remove false negatives.
2. A detailed ablation study on different hard-negative mining methods and their configurations.
3. The release of NV-Retriever-v1, a state-of-the-art text retrieval model.
The authors recommend practitioners experiment with different configurations of the proposed methods to find the best setup for their fine-tuning and evaluation tasks. They also encourage future research to disclose their mining methodologies for reproducibility and replicability.