7 Feb 2022 | Sebastian Borgeaud†, Arthur Mensch†, Jordan Hoffmann†, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Pagani, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae†, Erich Elsen† and Laurent Sifre†,‡
The paper introduces Retrieval-Enhanced Transformers (RETRO), a method that enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. Using a 2 trillion token database, RETRO achieves comparable performance to GPT-3 and Jurassic-1 on the Pile dataset, despite using 25 times fewer parameters. RETRO combines a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism to predict tokens based on significantly more data than typically used during training. The method scales well with both model size and database size, and can be fine-tuned for downstream tasks such as question answering. The paper also addresses the issue of test set leakage and proposes an evaluation methodology to address it. Overall, RETRO demonstrates that semi-parametric approaches can provide an efficient and orthogonal way to enhance language models at unprecedented scales.The paper introduces Retrieval-Enhanced Transformers (RETRO), a method that enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. Using a 2 trillion token database, RETRO achieves comparable performance to GPT-3 and Jurassic-1 on the Pile dataset, despite using 25 times fewer parameters. RETRO combines a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism to predict tokens based on significantly more data than typically used during training. The method scales well with both model size and database size, and can be fine-tuned for downstream tasks such as question answering. The paper also addresses the issue of test set leakage and proposes an evaluation methodology to address it. Overall, RETRO demonstrates that semi-parametric approaches can provide an efficient and orthogonal way to enhance language models at unprecedented scales.