28 Jun 2024 | Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu
M3-Embedding is a versatile text embedding model that supports multi-linguality, multi-functionality, and multi-granularity. It can handle over 100 languages and supports various retrieval tasks, including dense retrieval, multi-vector retrieval, and sparse retrieval. The model is trained using a self-knowledge distillation approach, where relevance scores from different retrieval functions are integrated to enhance training quality. It also optimizes batching strategies to achieve large batch sizes and high training throughput, improving the discriminativeness of embeddings. M3-Embedding demonstrates superior performance in multilingual, cross-lingual, and long-document retrieval benchmarks, achieving state-of-the-art results. The model's effectiveness is supported by extensive data curation, including unsupervised data from multilingual corpora, supervised data, and synthesized data. The training process involves multi-stage optimization, including efficient batching and self-knowledge distillation. M3-Embedding is publicly available, providing high-quality training resources for text embeddings. The model's versatility is validated through experiments on multi-lingual retrieval, cross-lingual retrieval, and long-document retrieval, showing significant improvements over existing methods. The model's performance is further enhanced by combining different retrieval methods, leading to better results in various scenarios. The model's ability to handle different input granularities, from short sentences to long documents of up to 8,192 tokens, is also demonstrated. M3-Embedding addresses the limitations of existing text embeddings by supporting multiple languages, retrieval functions, and input granularities, making it a comprehensive solution for text embedding tasks.M3-Embedding is a versatile text embedding model that supports multi-linguality, multi-functionality, and multi-granularity. It can handle over 100 languages and supports various retrieval tasks, including dense retrieval, multi-vector retrieval, and sparse retrieval. The model is trained using a self-knowledge distillation approach, where relevance scores from different retrieval functions are integrated to enhance training quality. It also optimizes batching strategies to achieve large batch sizes and high training throughput, improving the discriminativeness of embeddings. M3-Embedding demonstrates superior performance in multilingual, cross-lingual, and long-document retrieval benchmarks, achieving state-of-the-art results. The model's effectiveness is supported by extensive data curation, including unsupervised data from multilingual corpora, supervised data, and synthesized data. The training process involves multi-stage optimization, including efficient batching and self-knowledge distillation. M3-Embedding is publicly available, providing high-quality training resources for text embeddings. The model's versatility is validated through experiments on multi-lingual retrieval, cross-lingual retrieval, and long-document retrieval, showing significant improvements over existing methods. The model's performance is further enhanced by combining different retrieval methods, leading to better results in various scenarios. The model's ability to handle different input granularities, from short sentences to long documents of up to 8,192 tokens, is also demonstrated. M3-Embedding addresses the limitations of existing text embeddings by supporting multiple languages, retrieval functions, and input granularities, making it a comprehensive solution for text embedding tasks.