2024-3-1 | Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas and Caglar Gulcehre
This paper introduces Hawk and Griffin, two new recurrent models that combine gated linear recurrences with local attention to improve efficiency and performance in language modeling. Hawk is a pure RNN model that uses a novel gated linear recurrent unit (RG-LRU) as its temporal mixing block, while Griffin is a hybrid model that combines recurrent blocks with local attention. Both models outperform existing models in terms of performance and efficiency.
Hawk and Griffin demonstrate power-law scaling in held-out loss with training FLOPs, up to 7B parameters. Griffin achieves lower held-out loss than strong Transformer baselines at all model scales. When trained on fewer tokens, Griffin matches the performance of Llama-2, while Hawk exceeds the performance of Mamba on downstream tasks. Both models achieve comparable training efficiency to Transformers on TPUs and significantly higher inference throughput with lower latency.
Griffin can extrapolate on sequences significantly longer than those seen during training and efficiently learns copying and retrieval tasks from training data. However, it performs less well than Transformers on copying and exact-retrieval tasks without fine-tuning. The RG-LRU layer, which is central to both models, is designed to be efficient and memory-bound, allowing for efficient distributed training.
The models are trained on the MassiveText dataset and evaluated on various downstream tasks. Griffin achieves strong performance on tasks such as MMLU, HellaSwag, PIQA, ARC-E, and ARC-C, as well as on the WinoGrande task. Both models are efficient in training and inference, with Griffin showing particularly high throughput and lower latency for long sequences.
The paper also discusses the challenges of training large recurrent models, including efficient model parallelism and the implementation of linear recurrences on TPUs. Griffin's design allows for efficient training and inference, with a custom Pallas kernel that minimizes memory transfers and improves performance.
Overall, the paper shows that recurrent models like Hawk and Griffin can match or exceed the performance of Transformers while being more efficient in terms of training and inference. The RG-LRU layer and the combination of recurrent blocks with local attention are key innovations that enable these improvements.This paper introduces Hawk and Griffin, two new recurrent models that combine gated linear recurrences with local attention to improve efficiency and performance in language modeling. Hawk is a pure RNN model that uses a novel gated linear recurrent unit (RG-LRU) as its temporal mixing block, while Griffin is a hybrid model that combines recurrent blocks with local attention. Both models outperform existing models in terms of performance and efficiency.
Hawk and Griffin demonstrate power-law scaling in held-out loss with training FLOPs, up to 7B parameters. Griffin achieves lower held-out loss than strong Transformer baselines at all model scales. When trained on fewer tokens, Griffin matches the performance of Llama-2, while Hawk exceeds the performance of Mamba on downstream tasks. Both models achieve comparable training efficiency to Transformers on TPUs and significantly higher inference throughput with lower latency.
Griffin can extrapolate on sequences significantly longer than those seen during training and efficiently learns copying and retrieval tasks from training data. However, it performs less well than Transformers on copying and exact-retrieval tasks without fine-tuning. The RG-LRU layer, which is central to both models, is designed to be efficient and memory-bound, allowing for efficient distributed training.
The models are trained on the MassiveText dataset and evaluated on various downstream tasks. Griffin achieves strong performance on tasks such as MMLU, HellaSwag, PIQA, ARC-E, and ARC-C, as well as on the WinoGrande task. Both models are efficient in training and inference, with Griffin showing particularly high throughput and lower latency for long sequences.
The paper also discusses the challenges of training large recurrent models, including efficient model parallelism and the implementation of linear recurrences on TPUs. Griffin's design allows for efficient training and inference, with a custom Pallas kernel that minimizes memory transfers and improves performance.
Overall, the paper shows that recurrent models like Hawk and Griffin can match or exceed the performance of Transformers while being more efficient in terms of training and inference. The RG-LRU layer and the combination of recurrent blocks with local attention are key innovations that enable these improvements.