RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

2024-04-12 | Griffin, RLHF and Gemma Teams
RecurrentGemma is an open language model developed by Google DeepMind, based on the Griffin architecture. It uses a combination of linear recurrences and local attention to achieve high performance on language tasks. Unlike traditional transformer models, RecurrentGemma has a fixed-sized state, which reduces memory usage and enables efficient inference on long sequences. The model is pre-trained with 2B non-embedding parameters and an instruction-tuned variant, both achieving performance comparable to Gemma-2B despite being trained on fewer tokens. RecurrentGemma-2B is based on the Griffin architecture, which avoids global attention and instead uses a mixture of linear recurrences and local attention. It achieves superb performance on downstream tasks, competitive with Gemma-2B, an open transformer model. During inference, transformers require a growing KV cache, which can be memory-intensive. RecurrentGemma compresses input sequences into a fixed-size state, reducing memory use and enabling efficient inference on long sequences. It achieves significantly faster inference than Gemma-2B. The model is released with a pre-trained checkpoint and an instruction-tuned checkpoint, fine-tuned for instruction-following and dialogue. Efficient JAX code is provided for evaluation and fine-tuning, including a specialized Pallas kernel for TPUs. A PyTorch implementation is also provided. RecurrentGemma-2B is trained on 2T tokens, using the same pre-training data as Gemma-2B, which includes English data from web documents, mathematics, and code. It is trained on a large general data mixture followed by a smaller, higher quality dataset. The model uses a subset of the SentencePiece tokenizer with a vocabulary size of 256k tokens. Instruction tuning and RLHF are used to fine-tune the model for instruction-following and dialogue. The model is evaluated across a broad range of domains using automated benchmarks and human evaluation. RecurrentGemma-2B achieves comparable performance to Gemma-2B on automated benchmarks and outperforms the Mistral 7B model in human evaluation. RecurrentGemma has a significantly smaller state size than transformers on long sequences, enabling efficient inference. It achieves higher throughput at all sequence lengths considered, with throughput not reducing as sequence length increases, unlike Gemma. The model is released with safety evaluations and recommendations for responsible deployment. RecurrentGemma-2B offers performance comparable to Gemma, with higher throughput during inference, especially on long sequences.RecurrentGemma is an open language model developed by Google DeepMind, based on the Griffin architecture. It uses a combination of linear recurrences and local attention to achieve high performance on language tasks. Unlike traditional transformer models, RecurrentGemma has a fixed-sized state, which reduces memory usage and enables efficient inference on long sequences. The model is pre-trained with 2B non-embedding parameters and an instruction-tuned variant, both achieving performance comparable to Gemma-2B despite being trained on fewer tokens. RecurrentGemma-2B is based on the Griffin architecture, which avoids global attention and instead uses a mixture of linear recurrences and local attention. It achieves superb performance on downstream tasks, competitive with Gemma-2B, an open transformer model. During inference, transformers require a growing KV cache, which can be memory-intensive. RecurrentGemma compresses input sequences into a fixed-size state, reducing memory use and enabling efficient inference on long sequences. It achieves significantly faster inference than Gemma-2B. The model is released with a pre-trained checkpoint and an instruction-tuned checkpoint, fine-tuned for instruction-following and dialogue. Efficient JAX code is provided for evaluation and fine-tuning, including a specialized Pallas kernel for TPUs. A PyTorch implementation is also provided. RecurrentGemma-2B is trained on 2T tokens, using the same pre-training data as Gemma-2B, which includes English data from web documents, mathematics, and code. It is trained on a large general data mixture followed by a smaller, higher quality dataset. The model uses a subset of the SentencePiece tokenizer with a vocabulary size of 256k tokens. Instruction tuning and RLHF are used to fine-tune the model for instruction-following and dialogue. The model is evaluated across a broad range of domains using automated benchmarks and human evaluation. RecurrentGemma-2B achieves comparable performance to Gemma-2B on automated benchmarks and outperforms the Mistral 7B model in human evaluation. RecurrentGemma has a significantly smaller state size than transformers on long sequences, enabling efficient inference. It achieves higher throughput at all sequence lengths considered, with throughput not reducing as sequence length increases, unlike Gemma. The model is released with safety evaluations and recommendations for responsible deployment. RecurrentGemma-2B offers performance comparable to Gemma, with higher throughput during inference, especially on long sequences.
Reach us at info@study.space
Understanding RecurrentGemma%3A Moving Past Transformers for Efficient Open Language Models