[slides and audio] Gecko%3A Versatile Text Embeddings Distilled from Large Language Models

Gecko is a versatile and compact text embedding model that leverages knowledge distillation from large language models (LLMs) to improve retrieval performance. The model's training process involves two steps: first, generating diverse synthetic paired data using an LLM, and second, refining the data quality by retrieving candidate passages for each query and relabeling positive and hard negative passages using the same LLM. This approach enhances the effectiveness of the model, as demonstrated by its superior performance on the Massive Text Embedding Benchmark (MTEB) compared to existing models with larger embedding sizes and parameters. Gecko achieves strong retrieval performance, outperforming models with 7x larger base models and 5x higher dimensional embeddings on the MTEB benchmark. The paper also discusses the importance of LLM-based relabeling and the diversity of the synthetic dataset in achieving good results.Gecko is a versatile and compact text embedding model that leverages knowledge distillation from large language models (LLMs) to improve retrieval performance. The model's training process involves two steps: first, generating diverse synthetic paired data using an LLM, and second, refining the data quality by retrieving candidate passages for each query and relabeling positive and hard negative passages using the same LLM. This approach enhances the effectiveness of the model, as demonstrated by its superior performance on the Massive Text Embedding Benchmark (MTEB) compared to existing models with larger embedding sizes and parameters. Gecko achieves strong retrieval performance, outperforming models with 7x larger base models and 5x higher dimensional embeddings on the MTEB benchmark. The paper also discusses the importance of LLM-based relabeling and the diversity of the synthetic dataset in achieving good results.

Gecko: Versatile Text Embeddings Distilled from Large Language Models