Understanding NV-Embed%3A Improved Techniques for Training LLMs as Generalist Embedding Models

The paper introduces NV-Embed, a novel model designed to enhance the performance of decoder-only large language models (LLMs) as generalist embedding models. The key contributions of the work include: 1. **Model Architecture**: The authors propose a latent attention layer to improve the pooling of token embeddings, which consistently outperforms mean pooling and the last <EOS> token embedding. This layer is simpler and more effective than existing methods, such as Perceiver IO. 2. **Training Procedure**: A two-stage contrastive instruction-tuning method is introduced. The first stage involves contrastive training on retrieval datasets using in-batch negatives and curated hard-negative examples. The second stage blends various non-retrieval datasets into the training process, enhancing both non-retrieval tasks and retrieval performance. 3. **Performance**: The NV-Embed model achieves a record-high score of 69.32 on the Massive Text Embedding Benchmark (MTEB), ranking first among 56 tasks. It also attains the highest score of 59.36 on 15 retrieval tasks in the BEIR benchmark. 4. **Public Data**: The model is trained exclusively using publicly available data, without any proprietary synthetic data from models like GPT-4. The paper discusses related work, including bidirectional embedding models and decoder-only LLM-based embedding models, and provides detailed experimental setups and results. The NV-Embed model demonstrates superior performance in various embedding tasks, making it a significant advancement in the field of text embedding models.The paper introduces NV-Embed, a novel model designed to enhance the performance of decoder-only large language models (LLMs) as generalist embedding models. The key contributions of the work include: 1. **Model Architecture**: The authors propose a latent attention layer to improve the pooling of token embeddings, which consistently outperforms mean pooling and the last <EOS> token embedding. This layer is simpler and more effective than existing methods, such as Perceiver IO. 2. **Training Procedure**: A two-stage contrastive instruction-tuning method is introduced. The first stage involves contrastive training on retrieval datasets using in-batch negatives and curated hard-negative examples. The second stage blends various non-retrieval datasets into the training process, enhancing both non-retrieval tasks and retrieval performance. 3. **Performance**: The NV-Embed model achieves a record-high score of 69.32 on the Massive Text Embedding Benchmark (MTEB), ranking first among 56 tasks. It also attains the highest score of 59.36 on 15 retrieval tasks in the BEIR benchmark. 4. **Public Data**: The model is trained exclusively using publicly available data, without any proprietary synthetic data from models like GPT-4. The paper discusses related work, including bidirectional embedding models and decoder-only LLM-based embedding models, and provides detailed experimental setups and results. The NV-Embed model demonstrates superior performance in various embedding tasks, making it a significant advancement in the field of text embedding models.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

27 May 2024 | Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping