NV-Retriever: Improving text embedding models with effective hard-negative mining

NV-Retriever: Improving text embedding models with effective hard-negative mining

22 Jul 2024 | Gabriel de Souza P. Moreira*, Radek Osmulski*, Mengyao Xu*, Ronay Ak*, Benedikt Schifferer*, Even Oldridge*
NV-Retriever: Improving Text Embedding Models with Effective Hard-Negative Mining This paper introduces NV-Retriever-v1, a state-of-the-art text embedding model that achieves high performance in text retrieval tasks. The model scores 60.9 on the MTEB Retrieval (BEIR) benchmark, outperforming previous methods by 0.65 points. The model placed first on the MTEB Retrieval leaderboard when it was published on July 7, 2024. The key contribution of this work is the development of positive-aware hard-negative mining methods that leverage positive relevance scores to more effectively remove false negatives. These methods are compared with other hard-negative mining approaches in an ablation study, which shows that positive-aware methods significantly improve retrieval accuracy. The paper also presents a comprehensive ablation study of different hard-negative mining methods, exploring various teacher and base models. The results show that using more powerful teacher models can yield more effective hard-negatives, leading to better performance in fine-tuning. The NV-Retriever-v1 model is trained using a combination of retrieval supervised data and other tasks' datasets. The model uses a Mistral 7B base model and incorporates bi-directional attention for improved performance. The training process includes two stages: the first stage uses retrieval supervised data with in-batch negatives, while the second stage blends data for retrieval tasks with other tasks' datasets. The paper also discusses the importance of hard-negative mining in text retrieval and presents various techniques for selecting and refining hard-negatives. These techniques include using different teacher models for mining, ensembling hard-negatives from different models, and sampling from a range of top-k candidates. The results show that positive-aware hard-negative mining methods, such as TopK-MarginPos and TopK-PercPos, perform the best in removing false negatives and improving retrieval accuracy. The TopK-PercPos method, which sets the maximum threshold for negative scores to 95% of the positive score, is particularly effective. The NV-Retriever-v1 model is released as an open-source text retrieval model, providing state-of-the-art accuracy for text retrieval. The model's architecture, training methods, and techniques are described in detail, along with the results of the ablation study and comparisons with other models. The paper concludes that positive-aware hard-negative mining is crucial for achieving high performance in text retrieval tasks.NV-Retriever: Improving Text Embedding Models with Effective Hard-Negative Mining This paper introduces NV-Retriever-v1, a state-of-the-art text embedding model that achieves high performance in text retrieval tasks. The model scores 60.9 on the MTEB Retrieval (BEIR) benchmark, outperforming previous methods by 0.65 points. The model placed first on the MTEB Retrieval leaderboard when it was published on July 7, 2024. The key contribution of this work is the development of positive-aware hard-negative mining methods that leverage positive relevance scores to more effectively remove false negatives. These methods are compared with other hard-negative mining approaches in an ablation study, which shows that positive-aware methods significantly improve retrieval accuracy. The paper also presents a comprehensive ablation study of different hard-negative mining methods, exploring various teacher and base models. The results show that using more powerful teacher models can yield more effective hard-negatives, leading to better performance in fine-tuning. The NV-Retriever-v1 model is trained using a combination of retrieval supervised data and other tasks' datasets. The model uses a Mistral 7B base model and incorporates bi-directional attention for improved performance. The training process includes two stages: the first stage uses retrieval supervised data with in-batch negatives, while the second stage blends data for retrieval tasks with other tasks' datasets. The paper also discusses the importance of hard-negative mining in text retrieval and presents various techniques for selecting and refining hard-negatives. These techniques include using different teacher models for mining, ensembling hard-negatives from different models, and sampling from a range of top-k candidates. The results show that positive-aware hard-negative mining methods, such as TopK-MarginPos and TopK-PercPos, perform the best in removing false negatives and improving retrieval accuracy. The TopK-PercPos method, which sets the maximum threshold for negative scores to 95% of the positive score, is particularly effective. The NV-Retriever-v1 model is released as an open-source text retrieval model, providing state-of-the-art accuracy for text retrieval. The model's architecture, training methods, and techniques are described in detail, along with the results of the ablation study and comparisons with other models. The paper concludes that positive-aware hard-negative mining is crucial for achieving high performance in text retrieval tasks.
Reach us at info@study.space
Understanding NV-Retriever%3A Improving text embedding models with effective hard-negative mining