Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

9 Jul 2024 | Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei Koh
This paper explores how the size of a datastore used during inference affects the performance of retrieval-based language models (RBLMs). The authors construct a 1.4 trillion-token datastore called MASSIVEDS, which is the largest and most diverse open-sourced datastore for RBLMs. They demonstrate that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks, with a smaller model augmented with a large datastore outperforming a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, they show that using larger datastores can significantly improve model performance for the same training compute budget. They also analyze the effect of improving the retriever, datastore quality filtering, and other design choices on their observed scaling trends. Overall, their results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. They open-source their datastore and code at https://github.com/RulinShao/retrieval-scaling. The study shows that retrieval-based LMs can achieve superior performance than LM-only models at the same training cost, and that even weak language models can benefit significantly from retrieval on knowledge-intensive tasks. They also find that retrieval shows benefits for reasoning-intensive tasks with capable OLMO models, but does not help when the language model is not sufficiently advanced such as PYTHIA. The paper concludes that increasing the scale of data available at inference time can improve model performance, at lower training cost, on language modeling and a variety of downstream tasks.This paper explores how the size of a datastore used during inference affects the performance of retrieval-based language models (RBLMs). The authors construct a 1.4 trillion-token datastore called MASSIVEDS, which is the largest and most diverse open-sourced datastore for RBLMs. They demonstrate that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks, with a smaller model augmented with a large datastore outperforming a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, they show that using larger datastores can significantly improve model performance for the same training compute budget. They also analyze the effect of improving the retriever, datastore quality filtering, and other design choices on their observed scaling trends. Overall, their results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. They open-source their datastore and code at https://github.com/RulinShao/retrieval-scaling. The study shows that retrieval-based LMs can achieve superior performance than LM-only models at the same training cost, and that even weak language models can benefit significantly from retrieval on knowledge-intensive tasks. They also find that retrieval shows benefits for reasoning-intensive tasks with capable OLMO models, but does not help when the language model is not sufficiently advanced such as PYTHIA. The paper concludes that increasing the scale of data available at inference time can improve model performance, at lower training cost, on language modeling and a variety of downstream tasks.
Reach us at info@study.space
[slides] Scaling Retrieval-Based Language Models with a Trillion-Token Datastore | StudySpace