July 1, 2024 | Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, Chenyang Xi, Yu Yu, Kai Chen, Feiyu Xiong, Linpeng Tang, and Weinan E
The paper introduces Memory³, a novel architecture for large language models (LLMs) that incorporates explicit memory to reduce the cost of training and inference. Inspired by the human brain's memory hierarchy, Memory³ externalizes most of its knowledge to explicit memories, which are stored as sparse attention key-values. This approach reduces the parameter size, training cost, and inference cost of the LLM. The model is trained from scratch with 2.4 billion parameters and achieves better performance than larger LLMs and retrieval-augmented generation (RAG) models while maintaining higher decoding speed. The paper also introduces a memory circuitry theory to support the externalization of knowledge and presents techniques such as memory sparsification and a two-stage pretraining scheme. The explicit memory mechanism enables the LLM to handle long contexts more efficiently, improve factuality, and reduce hallucination. The architecture is designed to be general-purpose, allowing most existing Transformer-based LLMs to accommodate explicit memories with minimal fine-tuning.The paper introduces Memory³, a novel architecture for large language models (LLMs) that incorporates explicit memory to reduce the cost of training and inference. Inspired by the human brain's memory hierarchy, Memory³ externalizes most of its knowledge to explicit memories, which are stored as sparse attention key-values. This approach reduces the parameter size, training cost, and inference cost of the LLM. The model is trained from scratch with 2.4 billion parameters and achieves better performance than larger LLMs and retrieval-augmented generation (RAG) models while maintaining higher decoding speed. The paper also introduces a memory circuitry theory to support the externalization of knowledge and presents techniques such as memory sparsification and a two-stage pretraining scheme. The explicit memory mechanism enables the LLM to handle long contexts more efficiently, improve factuality, and reduce hallucination. The architecture is designed to be general-purpose, allowing most existing Transformer-based LLMs to accommodate explicit memories with minimal fine-tuning.