20 May 2024 | Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani
SSAMBA is a self-supervised, attention-free, and state space model (SSM)-based audio representation learning model. It leverages bidirectional Mamba to capture complex audio patterns effectively. SSAMBA is pretrained on large-scale, unlabeled datasets using a self-supervised framework that optimizes both discriminative and generative objectives. It outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks, with significant improvements in efficiency. For example, the Tiny model of SSAMBA is approximately 92.7% faster in inference speed and 95.4% more memory-efficient than SSAST for an input size of 22k tokens. SSAMBA's architecture is more efficient than transformers, with subquadratic time and memory complexity. It is trained on audio spectrograms, which are split into patches and transformed into embeddings. These embeddings are processed by a bidirectional Mamba encoder, which captures global audio context. SSAMBA is evaluated on various tasks such as audio classification, keyword spotting, and speaker identification. The results show that SSAMBA achieves superior or comparable performance to SSAST while significantly reducing inference costs. The main contributions of this study include the introduction of SSAMBA, the implementation of three model sizes (Tiny, Small, and Base), and the demonstration of SSAMBA's efficiency and performance. The model's robust performance stems from architectural innovations that capture complex audio patterns and a mixed dataset pretraining strategy with AudioSet and LibriSpeech. This combination enhances generalization across diverse audio types. SSAMBA's efficiency on resource-constrained devices suggests potential for broad real-world applications, from mobile and edge devices to large-scale cloud systems.SSAMBA is a self-supervised, attention-free, and state space model (SSM)-based audio representation learning model. It leverages bidirectional Mamba to capture complex audio patterns effectively. SSAMBA is pretrained on large-scale, unlabeled datasets using a self-supervised framework that optimizes both discriminative and generative objectives. It outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks, with significant improvements in efficiency. For example, the Tiny model of SSAMBA is approximately 92.7% faster in inference speed and 95.4% more memory-efficient than SSAST for an input size of 22k tokens. SSAMBA's architecture is more efficient than transformers, with subquadratic time and memory complexity. It is trained on audio spectrograms, which are split into patches and transformed into embeddings. These embeddings are processed by a bidirectional Mamba encoder, which captures global audio context. SSAMBA is evaluated on various tasks such as audio classification, keyword spotting, and speaker identification. The results show that SSAMBA achieves superior or comparable performance to SSAST while significantly reducing inference costs. The main contributions of this study include the introduction of SSAMBA, the implementation of three model sizes (Tiny, Small, and Base), and the demonstration of SSAMBA's efficiency and performance. The model's robust performance stems from architectural innovations that capture complex audio patterns and a mixed dataset pretraining strategy with AudioSet and LibriSpeech. This combination enhances generalization across diverse audio types. SSAMBA's efficiency on resource-constrained devices suggests potential for broad real-world applications, from mobile and edge devices to large-scale cloud systems.