5 Jun 2024 | Mehmet Hamza Erol*, Arda Senocak*, Jiu Feng, Joon Son Chung
The paper introduces Audio Mamba (AuM), a novel architecture for audio classification that leverages bidirectional state space models (SSMs) to process audio spectrograms without relying on self-attention. This approach aims to address the computational inefficiencies of Audio Spectrogram Transformers (ASTs), which have quadratic scaling due to self-attention. AuM is designed to be efficient in terms of both time and memory, achieving linear complexity relative to sequence length and feature dimension. The model is evaluated on six diverse datasets, including AudioSet, VGGSound, VoxCeleb, Speech Commands V2, and Epic-Sounds, demonstrating comparable or better performance compared to well-established AST models. Key contributions include the elimination of self-attention, the introduction of bidirectional SSMs, and the strategic placement of a classification token. The paper also explores the impact of different design choices, such as the direction of SSM modules and the position of the classification token, and compares the models with pre-trained weights from ImageNet and AudioSet. Overall, AuM shows promise as a versatile and efficient alternative to AST, particularly for handling long audio sequences and future applications in self-supervised learning and multimodal tasks.The paper introduces Audio Mamba (AuM), a novel architecture for audio classification that leverages bidirectional state space models (SSMs) to process audio spectrograms without relying on self-attention. This approach aims to address the computational inefficiencies of Audio Spectrogram Transformers (ASTs), which have quadratic scaling due to self-attention. AuM is designed to be efficient in terms of both time and memory, achieving linear complexity relative to sequence length and feature dimension. The model is evaluated on six diverse datasets, including AudioSet, VGGSound, VoxCeleb, Speech Commands V2, and Epic-Sounds, demonstrating comparable or better performance compared to well-established AST models. Key contributions include the elimination of self-attention, the introduction of bidirectional SSMs, and the strategic placement of a classification token. The paper also explores the impact of different design choices, such as the direction of SSM modules and the position of the classification token, and compares the models with pre-trained weights from ImageNet and AudioSet. Overall, AuM shows promise as a versatile and efficient alternative to AST, particularly for handling long audio sequences and future applications in self-supervised learning and multimodal tasks.