2024-3-1 | Soham De * 1, Samuel L. Smith * 1, Anushan Fernando *1, Aleksandar Botev *1, George Cristian-Muraru *1, Albert Gu2, Ruba Haroun1, Leonard Berrada1, Yutian Chen1, Srivatsan Srinivasan1, Guillaume Desjardins1, Arnaud Doucet1, David Budden1, Yee Whye Teh1, Razvan Pascanu1, Nando De Freitas1 and Caglar Gulcehre1
The paper "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models" by Soham De et al. introduces two novel models, Hawk and Griffin, designed to address the limitations of Recurrent Neural Networks (RNNs) in terms of training efficiency and scalability. RNNs, while effective for long sequences, are challenging to train and scale due to their quadratic complexity and memory requirements. The authors propose a gated linear recurrent layer, the Real-Gated Linear Recurrent Unit (RG-LRU), which is the core component of both models.
**Hawk** is a pure RNN model that interleaves gated linear recurrences with MLPs. It demonstrates superior performance on downstream tasks compared to the Transformer-based Mamba model, despite being trained on significantly fewer tokens. **Griffin** is a hybrid model that combines gated linear recurrences with local attention, achieving performance comparable to the Transformer-based Llama-2 model while being trained on much fewer tokens.
The paper highlights several key contributions:
1. **Power Law Scaling**: Both Hawk and Griffin exhibit power-law scaling between held-out loss and training FLOPs, similar to Transformers.
2. **Performance**: Griffin achieves lower held-out loss than strong Transformer baselines at all model scales.
3. **Efficiency**: Griffin matches the hardware efficiency of Transformers during training and shows lower latency and higher throughput during inference.
4. **Long Sequence Extrapolation**: Griffin can extrapolate to sequences significantly longer than those seen during training.
5. **Training Efficiency**: The models match the hardware efficiency of Transformers during training and show lower latency and higher throughput during inference.
The authors also discuss the challenges and solutions for efficient training and inference of recurrent models, including model parallelism and efficient implementation of linear recurrences on devices like TPUs. They provide detailed experimental results and comparisons with other models, demonstrating the effectiveness of their proposed models in various tasks and under different conditions.The paper "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models" by Soham De et al. introduces two novel models, Hawk and Griffin, designed to address the limitations of Recurrent Neural Networks (RNNs) in terms of training efficiency and scalability. RNNs, while effective for long sequences, are challenging to train and scale due to their quadratic complexity and memory requirements. The authors propose a gated linear recurrent layer, the Real-Gated Linear Recurrent Unit (RG-LRU), which is the core component of both models.
**Hawk** is a pure RNN model that interleaves gated linear recurrences with MLPs. It demonstrates superior performance on downstream tasks compared to the Transformer-based Mamba model, despite being trained on significantly fewer tokens. **Griffin** is a hybrid model that combines gated linear recurrences with local attention, achieving performance comparable to the Transformer-based Llama-2 model while being trained on much fewer tokens.
The paper highlights several key contributions:
1. **Power Law Scaling**: Both Hawk and Griffin exhibit power-law scaling between held-out loss and training FLOPs, similar to Transformers.
2. **Performance**: Griffin achieves lower held-out loss than strong Transformer baselines at all model scales.
3. **Efficiency**: Griffin matches the hardware efficiency of Transformers during training and shows lower latency and higher throughput during inference.
4. **Long Sequence Extrapolation**: Griffin can extrapolate to sequences significantly longer than those seen during training.
5. **Training Efficiency**: The models match the hardware efficiency of Transformers during training and show lower latency and higher throughput during inference.
The authors also discuss the challenges and solutions for efficient training and inference of recurrent models, including model parallelism and efficient implementation of linear recurrences on devices like TPUs. They provide detailed experimental results and comparisons with other models, demonstrating the effectiveness of their proposed models in various tasks and under different conditions.