20 May 2024 | Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani
This paper introduces SSAMBA, a self-supervised, attention-free, and state space model (SSM)-based audio representation learning model. Unlike traditional transformer models, which suffer from quadratic complexity in GPU memory usage and computational inference time, SSAMBA leverages the bidirectional Mamba SSM to efficiently capture complex audio patterns. The model is trained using a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling it to learn robust audio representations from large-scale, unlabeled datasets.
SSAMBA is evaluated on various audio tasks, including audio classification, keyword spotting, and speaker identification. The results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks, with significant improvements in efficiency. Specifically, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k.
The paper also explores the mathematical foundations of the Mamba model and the architecture of SSAMBA, detailing its input representation, linear projection, positional encoding, and bidirectional Mamba encoder. The self-supervised learning framework is described, including the use of masked spectrogram patches and the integration of discriminative and generative objectives.
Finally, the paper presents results from pretraining and downstream performance comparisons, showing that SSAMBA consistently outperforms SSAST, especially in larger model configurations. The efficiency gains and superior performance highlight the effectiveness of SSAMBA's architectural innovations, making it a compelling choice for a wide range of audio processing applications.This paper introduces SSAMBA, a self-supervised, attention-free, and state space model (SSM)-based audio representation learning model. Unlike traditional transformer models, which suffer from quadratic complexity in GPU memory usage and computational inference time, SSAMBA leverages the bidirectional Mamba SSM to efficiently capture complex audio patterns. The model is trained using a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling it to learn robust audio representations from large-scale, unlabeled datasets.
SSAMBA is evaluated on various audio tasks, including audio classification, keyword spotting, and speaker identification. The results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks, with significant improvements in efficiency. Specifically, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k.
The paper also explores the mathematical foundations of the Mamba model and the architecture of SSAMBA, detailing its input representation, linear projection, positional encoding, and bidirectional Mamba encoder. The self-supervised learning framework is described, including the use of masked spectrogram patches and the integration of discriminative and generative objectives.
Finally, the paper presents results from pretraining and downstream performance comparisons, showing that SSAMBA consistently outperforms SSAST, especially in larger model configurations. The efficiency gains and superior performance highlight the effectiveness of SSAMBA's architectural innovations, making it a compelling choice for a wide range of audio processing applications.