Zamba: A Compact 7B SSM Hybrid Model

Zamba: A Compact 7B SSM Hybrid Model

26 May 2024 | Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Adam Ibrahim, Beren Millidge
Zamba is a 7B parameter SSM (state-space model) hybrid model that achieves competitive performance with leading open-source models. It is trained on 1T tokens from open datasets and is the best non-transformer model at this scale. Zamba combines a Mamba backbone with a single shared attention module, achieving the benefits of attention with minimal parameter cost. It is significantly faster at inference and requires less memory for long sequence generation compared to transformer models. Zamba is pretrained in two phases: first on web datasets, then on high-quality instruct and synthetic data with a rapid learning rate decay. The model is open-sourced, including all checkpoints from both phases. Zamba's architecture is inspired by the brain's cortex and hippocampus, where different layers share a common memory store. This allows for efficient performance with minimal memory usage. Zamba's unique architecture combines Mamba blocks with a global shared self-attention layer, merging the benefits of transformers for in-context learning with Mamba's inference efficiency. This architecture enables performance improvements with a small and constant parameter cost. Zamba was trained on a relatively small budget of $200k and a team of 7 researchers over a month, achieving performance comparable to leading models. It outperforms models like Llama2 and Pythia, and performs well on general language modeling and reasoning benchmarks despite being trained on fewer and potentially lower-quality tokens. Zamba's inference and generation efficiency is significantly better than comparable models, with faster forward passes and reduced memory usage for KV caching. Zamba's two-phase training approach, involving a pretraining phase and an annealing phase with high-quality data, significantly improves model performance. The model's performance on benchmarks like MMLU and ARC is competitive with leading models, although it lags slightly on reasoning tasks. Zamba's open checkpoints allow for further study of learning dynamics and architectural benefits. The model's architecture and training approach provide insights into the potential of SSMs and hybrid models for efficient language modeling.Zamba is a 7B parameter SSM (state-space model) hybrid model that achieves competitive performance with leading open-source models. It is trained on 1T tokens from open datasets and is the best non-transformer model at this scale. Zamba combines a Mamba backbone with a single shared attention module, achieving the benefits of attention with minimal parameter cost. It is significantly faster at inference and requires less memory for long sequence generation compared to transformer models. Zamba is pretrained in two phases: first on web datasets, then on high-quality instruct and synthetic data with a rapid learning rate decay. The model is open-sourced, including all checkpoints from both phases. Zamba's architecture is inspired by the brain's cortex and hippocampus, where different layers share a common memory store. This allows for efficient performance with minimal memory usage. Zamba's unique architecture combines Mamba blocks with a global shared self-attention layer, merging the benefits of transformers for in-context learning with Mamba's inference efficiency. This architecture enables performance improvements with a small and constant parameter cost. Zamba was trained on a relatively small budget of $200k and a team of 7 researchers over a month, achieving performance comparable to leading models. It outperforms models like Llama2 and Pythia, and performs well on general language modeling and reasoning benchmarks despite being trained on fewer and potentially lower-quality tokens. Zamba's inference and generation efficiency is significantly better than comparable models, with faster forward passes and reduced memory usage for KV caching. Zamba's two-phase training approach, involving a pretraining phase and an annealing phase with high-quality data, significantly improves model performance. The model's performance on benchmarks like MMLU and ARC is competitive with leading models, although it lags slightly on reasoning tasks. Zamba's open checkpoints allow for further study of learning dynamics and architectural benefits. The model's architecture and training approach provide insights into the potential of SSMs and hybrid models for efficient language modeling.
Reach us at info@study.space
[slides] Zamba%3A A Compact 7B SSM Hybrid Model | StudySpace