REALM: Retrieval-Augmented Language Model Pre-Training

REALM: Retrieval-Augmented Language Model Pre-Training

10 Feb 2020 | Kelvin Guu * 1 Kenton Lee * 1 Zora Tung 1 Panupong Pasupat 1 Ming-Wei Chang 1
REALM is a retrieval-augmented language model pre-training framework that enhances the ability of language models to capture and utilize world knowledge. Unlike traditional pre-training methods that store knowledge implicitly in neural network parameters, REALM introduces a latent knowledge retriever that allows the model to retrieve and attend to documents from a large corpus, such as Wikipedia, during pre-training, fine-tuning, and inference. This retriever is trained in an unsupervised manner using masked language modeling as the learning signal, with backpropagation through a retrieval step that considers millions of documents. The key idea is to train the retriever using performance-based signals from unsupervised text, rewarding retrievals that improve the language model's perplexity and penalizing uninformative ones. REALM outperforms previous methods on three popular open-domain question answering (Open-QA) benchmarks by 4-16% in absolute accuracy. It provides qualitative benefits such as interpretability and modularity. The framework includes a neural knowledge retriever that models the probability of retrieving documents based on inner product similarity, and a knowledge-augmented encoder that uses retrieved documents to inform predictions. The model is trained to maximize the likelihood of the generative process, involving both retrieval and prediction steps. The approach addresses computational challenges by using asynchronous maximum inner product search (MIPS) for efficient retrieval. The retriever is initialized using the inverse cloze task, and the model is pre-trained on a large corpus, with fine-tuning on Open-QA tasks. REALM's pre-training and fine-tuning processes are evaluated on benchmarks such as NaturalQuestions-Open, WebQuestions, and CuratedTrec, demonstrating significant improvements over existing methods. The framework is also shown to be effective in scenarios where the pre-training and knowledge corpora differ, and it can adapt to new knowledge by updating the corpus. The retriever's utility is measured by the difference in log-likelihood when conditioning on retrieved documents.REALM is a retrieval-augmented language model pre-training framework that enhances the ability of language models to capture and utilize world knowledge. Unlike traditional pre-training methods that store knowledge implicitly in neural network parameters, REALM introduces a latent knowledge retriever that allows the model to retrieve and attend to documents from a large corpus, such as Wikipedia, during pre-training, fine-tuning, and inference. This retriever is trained in an unsupervised manner using masked language modeling as the learning signal, with backpropagation through a retrieval step that considers millions of documents. The key idea is to train the retriever using performance-based signals from unsupervised text, rewarding retrievals that improve the language model's perplexity and penalizing uninformative ones. REALM outperforms previous methods on three popular open-domain question answering (Open-QA) benchmarks by 4-16% in absolute accuracy. It provides qualitative benefits such as interpretability and modularity. The framework includes a neural knowledge retriever that models the probability of retrieving documents based on inner product similarity, and a knowledge-augmented encoder that uses retrieved documents to inform predictions. The model is trained to maximize the likelihood of the generative process, involving both retrieval and prediction steps. The approach addresses computational challenges by using asynchronous maximum inner product search (MIPS) for efficient retrieval. The retriever is initialized using the inverse cloze task, and the model is pre-trained on a large corpus, with fine-tuning on Open-QA tasks. REALM's pre-training and fine-tuning processes are evaluated on benchmarks such as NaturalQuestions-Open, WebQuestions, and CuratedTrec, demonstrating significant improvements over existing methods. The framework is also shown to be effective in scenarios where the pre-training and knowledge corpora differ, and it can adapt to new knowledge by updating the corpus. The retriever's utility is measured by the difference in log-likelihood when conditioning on retrieved documents.
Reach us at info@study.space