7 Feb 2022 | Sebastian Borgeaud†, Arthur Mensch†, Jordan Hoffmann†, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Pagani, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae†, Erich Elsen† and Laurent Sifre†,‡
This paper introduces RETRO, a retrieval-enhanced autoregressive language model that improves performance by conditioning on document chunks retrieved from a large corpus based on local similarity with preceding tokens. RETRO uses a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than typically consumed during training. With a 2 trillion token database, RETRO achieves performance comparable to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters. After fine-tuning, RETRO performs well on downstream tasks such as question answering. RETRO can be trained from scratch or rapidly retrofit pre-trained transformers with retrieval, achieving good performance. The work demonstrates that explicit memory at unprecedented scale can improve language models. RETRO models are flexible and can be used without retrieval at evaluation while still achieving comparable performance to baseline models. Conversely, baseline models can be rapidly fine-tuned into RETRO models to obtain nearly the same performance as if trained from scratch. Careful analysis shows that only a modest fraction of the gains obtained by RETRO are due to test set leakage. The paper also discusses privacy, safety, and fairness concerns related to using large-scale retrieval databases. Overall, the work shows that semi-parametric approaches can provide an orthogonal, more efficient approach than raw parameter scaling for building more powerful language models.This paper introduces RETRO, a retrieval-enhanced autoregressive language model that improves performance by conditioning on document chunks retrieved from a large corpus based on local similarity with preceding tokens. RETRO uses a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than typically consumed during training. With a 2 trillion token database, RETRO achieves performance comparable to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters. After fine-tuning, RETRO performs well on downstream tasks such as question answering. RETRO can be trained from scratch or rapidly retrofit pre-trained transformers with retrieval, achieving good performance. The work demonstrates that explicit memory at unprecedented scale can improve language models. RETRO models are flexible and can be used without retrieval at evaluation while still achieving comparable performance to baseline models. Conversely, baseline models can be rapidly fine-tuned into RETRO models to obtain nearly the same performance as if trained from scratch. Careful analysis shows that only a modest fraction of the gains obtained by RETRO are due to test set leakage. The paper also discusses privacy, safety, and fairness concerns related to using large-scale retrieval databases. Overall, the work shows that semi-parametric approaches can provide an orthogonal, more efficient approach than raw parameter scaling for building more powerful language models.