Gecko: Versatile Text Embeddings Distilled from Large Language Models

Gecko: Versatile Text Embeddings Distilled from Large Language Models

29 Mar 2024 | Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang and Iftkehar Naim
Gecko is a compact and versatile text embedding model derived from large language models (LLMs). It achieves strong retrieval performance by distilling knowledge from LLMs into a retriever. The model uses a two-step distillation process: first, generating diverse synthetic paired data using an LLM, then refining data quality by retrieving candidate passages and relabeling positive and hard negative passages using the same LLM. Gecko with 256 embedding dimensions outperforms existing models on the Massive Text Embedding Benchmark (MTEB), achieving an average score of 66.31 with 768 dimensions, comparable to larger models and higher-dimensional embeddings. Gecko is trained on a large corpus of passages, using a few-shot prompted LLM to generate tasks and queries. The model then embeds the concatenated task and query using a pretrained embedding model to obtain nearest neighbor passages, which are reranked by an LLM to identify positive and negative passages. This reranking step is crucial for enhancing data quality, as the best passage to answer a query may differ from the original source passage. The model, Gecko-1B with 768-dimensional embeddings, achieves the best performance on MTEB among models with compatible embedding dimensions and sizes. Gecko often outperforms other systems using larger base models or higher-dimensional embeddings. The FRet dataset, generated using LLMs, is combined with human-annotated data to train Gecko, resulting in strong performance on MTEB. Gecko leverages knowledge distillation to create a two-step LLM-powered embedding model. It uses a pre-trained transformer language model with two additional training stages: pre-finetuning and fine-tuning. The pre-finetuning stage uses self-supervised tasks on a large text corpus to improve performance for downstream tasks. The fine-tuning stage creates a novel dataset, FRet, using a two-step LLM distillation process to identify both positive and hard negative passages for each generated query. The FRet dataset is generated by using LLMs to create diverse queries and tasks from a web corpus. The model then uses an existing embedding model to retrieve top N neighbors from the corpus given a generated query and ranks these passages using LLMs to mine positive and negative passages. The reranking step is crucial for enhancing data quality, as the best passage to answer a query may differ from the original source passage. Gecko is trained on a mixture of FRet and other academic datasets, creating a novel fine-tuning mixture. The model is trained using a standard loss function, with the goal of distinguishing positive passages from hard negative passages. The model achieves strong performance on MTEB, with Gecko-1B-256 demonstrating superior quality compared to other models. Gecko's performance on MTEB is strong, with Gecko-1B-768 often matching or exceeding the performance of even larger models. The model is particularly good at balancing retrieval andGecko is a compact and versatile text embedding model derived from large language models (LLMs). It achieves strong retrieval performance by distilling knowledge from LLMs into a retriever. The model uses a two-step distillation process: first, generating diverse synthetic paired data using an LLM, then refining data quality by retrieving candidate passages and relabeling positive and hard negative passages using the same LLM. Gecko with 256 embedding dimensions outperforms existing models on the Massive Text Embedding Benchmark (MTEB), achieving an average score of 66.31 with 768 dimensions, comparable to larger models and higher-dimensional embeddings. Gecko is trained on a large corpus of passages, using a few-shot prompted LLM to generate tasks and queries. The model then embeds the concatenated task and query using a pretrained embedding model to obtain nearest neighbor passages, which are reranked by an LLM to identify positive and negative passages. This reranking step is crucial for enhancing data quality, as the best passage to answer a query may differ from the original source passage. The model, Gecko-1B with 768-dimensional embeddings, achieves the best performance on MTEB among models with compatible embedding dimensions and sizes. Gecko often outperforms other systems using larger base models or higher-dimensional embeddings. The FRet dataset, generated using LLMs, is combined with human-annotated data to train Gecko, resulting in strong performance on MTEB. Gecko leverages knowledge distillation to create a two-step LLM-powered embedding model. It uses a pre-trained transformer language model with two additional training stages: pre-finetuning and fine-tuning. The pre-finetuning stage uses self-supervised tasks on a large text corpus to improve performance for downstream tasks. The fine-tuning stage creates a novel dataset, FRet, using a two-step LLM distillation process to identify both positive and hard negative passages for each generated query. The FRet dataset is generated by using LLMs to create diverse queries and tasks from a web corpus. The model then uses an existing embedding model to retrieve top N neighbors from the corpus given a generated query and ranks these passages using LLMs to mine positive and negative passages. The reranking step is crucial for enhancing data quality, as the best passage to answer a query may differ from the original source passage. Gecko is trained on a mixture of FRet and other academic datasets, creating a novel fine-tuning mixture. The model is trained using a standard loss function, with the goal of distinguishing positive passages from hard negative passages. The model achieves strong performance on MTEB, with Gecko-1B-256 demonstrating superior quality compared to other models. Gecko's performance on MTEB is strong, with Gecko-1B-768 often matching or exceeding the performance of even larger models. The model is particularly good at balancing retrieval and
Reach us at info@study.space
Understanding Gecko%3A Versatile Text Embeddings Distilled from Large Language Models