SAMBA is a simple hybrid architecture that combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA) to enable efficient unlimited context language modeling. The model selectively compresses sequences into recurrent hidden states while maintaining the ability to recall recent memories with attention. SAMBA achieves significant performance improvements across various benchmarks, outperforming state-of-the-art models. It is trained on sequences of 4K length and demonstrates improved perplexity in context lengths up to 1M in zero-shot. When fine-tuned on 4K-length sequences, SAMBA efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task and exhibits superior retrieval performance on the Phonebook task compared to full-attention models. As a linear-time sequence model, SAMBA achieves a 3.73× higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64× speedup when generating 64K tokens with unlimited streaming. The model is publicly available at https://github.com/microsoft/Samba. SAMBA's architecture combines the strengths of SSMs and attention-based models, achieving linear time complexity for unlimited length extrapolation. The model is evaluated on a wide range of benchmarks, including commonsense reasoning, language understanding, truthfulness, and math and coding tasks. SAMBA outperforms other models in most tasks and achieves the best average performance. The model's ability to extrapolate memory recall to 256K context length through supervised fine-tuning, while maintaining linear computation complexity, is notable. SAMBA also demonstrates efficient length extrapolation, achieving a 3.73× higher throughput in prompt processing compared to Llama-3 1.6B at the 128K prompt length. The model's performance on long-context tasks, such as Passkey Retrieval and Phonebook, highlights its effectiveness in handling long sequences. SAMBA's hybrid architecture combines the strengths of SSMs and attention mechanisms, leading to improved performance and efficiency in language modeling. The model's ability to extrapolate memory recall to very long contexts underscores its practical applicability for real-world tasks requiring extensive context understanding.SAMBA is a simple hybrid architecture that combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA) to enable efficient unlimited context language modeling. The model selectively compresses sequences into recurrent hidden states while maintaining the ability to recall recent memories with attention. SAMBA achieves significant performance improvements across various benchmarks, outperforming state-of-the-art models. It is trained on sequences of 4K length and demonstrates improved perplexity in context lengths up to 1M in zero-shot. When fine-tuned on 4K-length sequences, SAMBA efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task and exhibits superior retrieval performance on the Phonebook task compared to full-attention models. As a linear-time sequence model, SAMBA achieves a 3.73× higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64× speedup when generating 64K tokens with unlimited streaming. The model is publicly available at https://github.com/microsoft/Samba. SAMBA's architecture combines the strengths of SSMs and attention-based models, achieving linear time complexity for unlimited length extrapolation. The model is evaluated on a wide range of benchmarks, including commonsense reasoning, language understanding, truthfulness, and math and coding tasks. SAMBA outperforms other models in most tasks and achieves the best average performance. The model's ability to extrapolate memory recall to 256K context length through supervised fine-tuning, while maintaining linear computation complexity, is notable. SAMBA also demonstrates efficient length extrapolation, achieving a 3.73× higher throughput in prompt processing compared to Llama-3 1.6B at the 128K prompt length. The model's performance on long-context tasks, such as Passkey Retrieval and Phonebook, highlights its effectiveness in handling long sequences. SAMBA's hybrid architecture combines the strengths of SSMs and attention mechanisms, leading to improved performance and efficiency in language modeling. The model's ability to extrapolate memory recall to very long contexts underscores its practical applicability for real-world tasks requiring extensive context understanding.