28 Jun 2024 | Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu
The paper introduces M3-Embedding, a versatile text embedding model that supports multi-linguality, multi-functionality, and multi-granularity. It can handle over 100 languages, perform dense, sparse, and multi-vector retrieval, and process inputs ranging from short sentences to long documents of up to 8,192 tokens. The training of M3-Embedding involves several technical contributions, including a novel self-knowledge distillation approach to integrate relevance scores from different retrieval functionalities, an optimized batching strategy for high training throughput, and comprehensive data curation from various sources. Experimental results demonstrate that M3-Embedding outperforms existing methods on multi-lingual, cross-lingual, and long-document retrieval benchmarks, achieving state-of-the-art performance. The model's versatility and effectiveness make it a significant advancement in text embedding technology.The paper introduces M3-Embedding, a versatile text embedding model that supports multi-linguality, multi-functionality, and multi-granularity. It can handle over 100 languages, perform dense, sparse, and multi-vector retrieval, and process inputs ranging from short sentences to long documents of up to 8,192 tokens. The training of M3-Embedding involves several technical contributions, including a novel self-knowledge distillation approach to integrate relevance scores from different retrieval functionalities, an optimized batching strategy for high training throughput, and comprehensive data curation from various sources. Experimental results demonstrate that M3-Embedding outperforms existing methods on multi-lingual, cross-lingual, and long-document retrieval benchmarks, achieving state-of-the-art performance. The model's versatility and effectiveness make it a significant advancement in text embedding technology.