29 Mar 2024 | Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajiv Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang and Iftekhar Naim
Gecko is a versatile and compact text embedding model that leverages knowledge distillation from large language models (LLMs) to improve retrieval performance. The model's training process involves two steps: first, generating diverse synthetic paired data using an LLM, and second, refining the data quality by retrieving candidate passages for each query and relabeling positive and hard negative passages using the same LLM. This approach enhances the effectiveness of the model, as demonstrated by its superior performance on the Massive Text Embedding Benchmark (MTEB) compared to existing models with larger embedding sizes and parameters. Gecko achieves strong retrieval performance, outperforming models with 7x larger base models and 5x higher dimensional embeddings on the MTEB benchmark. The paper also discusses the importance of LLM-based relabeling and the diversity of the synthetic dataset in achieving good results.Gecko is a versatile and compact text embedding model that leverages knowledge distillation from large language models (LLMs) to improve retrieval performance. The model's training process involves two steps: first, generating diverse synthetic paired data using an LLM, and second, refining the data quality by retrieving candidate passages for each query and relabeling positive and hard negative passages using the same LLM. This approach enhances the effectiveness of the model, as demonstrated by its superior performance on the Massive Text Embedding Benchmark (MTEB) compared to existing models with larger embedding sizes and parameters. Gecko achieves strong retrieval performance, outperforming models with 7x larger base models and 5x higher dimensional embeddings on the MTEB benchmark. The paper also discusses the importance of LLM-based relabeling and the diversity of the synthetic dataset in achieving good results.