8 May 2024 | Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos
This report introduces the Arctic-embed family of text embedding models, which consists of five models ranging from 22 to 334 million parameters. These models achieved state-of-the-art retrieval accuracy on the MTEB Retrieval leaderboard at the time of their release, outperforming closed-source models like Cohere’s embed-v3 and OpenAI’s text-embed-3-large. The report details the training dataset creation process and provides ablation studies to explain the models' performance.
The Arctic-embed models are trained using a two-round approach: large-scale pretraining with in-batch negatives and fine-tuning with hard negative documents. The pretraining dataset includes about 300 million query-document pairs, while the fine-tuning dataset consists of around 1 million pairs. The training process leverages a tunable hard negative mining strategy and synthetic data generation techniques to improve retrieval quality.
Key contributions of the report include:
- Open-sourcing the Arctic-embed models under an Apache-2 license.
- Demonstrating the importance of data organization and quality in improving retrieval performance.
- Presenting novel methods for query generation and synthetic data creation.
The report also includes detailed experimental results and ablation studies to validate the effectiveness of the training methods and the impact of various hyperparameters. The authors conclude by discussing future work, including improved curriculum learning and robustness to compression techniques.This report introduces the Arctic-embed family of text embedding models, which consists of five models ranging from 22 to 334 million parameters. These models achieved state-of-the-art retrieval accuracy on the MTEB Retrieval leaderboard at the time of their release, outperforming closed-source models like Cohere’s embed-v3 and OpenAI’s text-embed-3-large. The report details the training dataset creation process and provides ablation studies to explain the models' performance.
The Arctic-embed models are trained using a two-round approach: large-scale pretraining with in-batch negatives and fine-tuning with hard negative documents. The pretraining dataset includes about 300 million query-document pairs, while the fine-tuning dataset consists of around 1 million pairs. The training process leverages a tunable hard negative mining strategy and synthetic data generation techniques to improve retrieval quality.
Key contributions of the report include:
- Open-sourcing the Arctic-embed models under an Apache-2 license.
- Demonstrating the importance of data organization and quality in improving retrieval performance.
- Presenting novel methods for query generation and synthetic data creation.
The report also includes detailed experimental results and ablation studies to validate the effectiveness of the training methods and the impact of various hyperparameters. The authors conclude by discussing future work, including improved curriculum learning and robustness to compression techniques.