Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

9 Jul 2024 | Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei Koh
This paper explores the scaling of retrieval-based language models (LMs) by examining the impact of the amount of data available at inference time, specifically through the use of a large datastore. The authors construct a 1.4 trillion-token datastore named MASSIVESD, which is the largest and most diverse open-sourced datastore for retrieval-based LMs. They find that increasing the size of the datastore monotonically improves language modeling and several downstream tasks, often outperforming larger LM-only models on knowledge-intensive tasks. The study also reveals that retrieval-based LMs can achieve superior compute-optimal scaling trends compared to LM-only models, suggesting that offloading FLOPs from pretraining to datastore construction can enhance performance. The authors further analyze the effects of data composition, reranking, and datastore filtering, concluding that broad and diverse datastores can improve performance across multiple domains. Overall, the paper highlights the importance of considering datastore size as an integral part of LM efficiency and performance trade-offs.This paper explores the scaling of retrieval-based language models (LMs) by examining the impact of the amount of data available at inference time, specifically through the use of a large datastore. The authors construct a 1.4 trillion-token datastore named MASSIVESD, which is the largest and most diverse open-sourced datastore for retrieval-based LMs. They find that increasing the size of the datastore monotonically improves language modeling and several downstream tasks, often outperforming larger LM-only models on knowledge-intensive tasks. The study also reveals that retrieval-based LMs can achieve superior compute-optimal scaling trends compared to LM-only models, suggesting that offloading FLOPs from pretraining to datastore construction can enhance performance. The authors further analyze the effects of data composition, reranking, and datastore filtering, concluding that broad and diverse datastores can improve performance across multiple domains. Overall, the paper highlights the importance of considering datastore size as an integral part of LM efficiency and performance trade-offs.
Reach us at info@study.space
[slides and audio] Scaling Retrieval-Based Language Models with a Trillion-Token Datastore