[slides] Improving language models by retrieving from trillions of tokens

The paper introduces Retrieval-Enhanced Transformers (RETRO), a method that enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. Using a 2 trillion token database, RETRO achieves comparable performance to GPT-3 and Jurassic-1 on the Pile dataset, despite using 25 times fewer parameters. RETRO combines a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism to predict tokens based on significantly more data than typically used during training. The method scales well with both model size and database size, and can be fine-tuned for downstream tasks such as question answering. The paper also addresses the issue of test set leakage and proposes an evaluation methodology to address it. Overall, RETRO demonstrates that semi-parametric approaches can provide an efficient and orthogonal way to enhance language models at unprecedented scales.The paper introduces Retrieval-Enhanced Transformers (RETRO), a method that enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. Using a 2 trillion token database, RETRO achieves comparable performance to GPT-3 and Jurassic-1 on the Pile dataset, despite using 25 times fewer parameters. RETRO combines a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism to predict tokens based on significantly more data than typically used during training. The method scales well with both model size and database size, and can be fine-tuned for downstream tasks such as question answering. The paper also addresses the issue of test set leakage and proposes an evaluation methodology to address it. Overall, RETRO demonstrates that semi-parametric approaches can provide an efficient and orthogonal way to enhance language models at unprecedented scales.