5 Jun 2024 | Mehmet Hamza Erol*, Arda Senocak*, Jiu Feng, Joon Son Chung
Audio Mamba (AuM) is a self-attention-free, state space model (SSM)-based architecture for audio classification. It aims to address the quadratic complexity of self-attention in Audio Spectrogram Transformers (ASTs) by using bidirectional SSMs to efficiently model long sequences. AuM is evaluated on six audio datasets, achieving performance comparable to or better than AST. The architecture processes audio spectrograms by dividing them into patches, adding a learnable classification token, and applying bidirectional SSMs for sequence modeling. AuM's design eliminates self-attention, enabling linear time complexity and efficient resource usage. It outperforms AST in terms of computational efficiency, with lower memory and faster inference times. Ablation studies show that placing the classification token in the middle of the sequence is optimal. AuM also demonstrates competitive performance when initialized with pre-trained weights, though it outperforms AST in in-domain pre-training. The model shows potential as a generic audio backbone, suitable for future applications in self-supervised learning and multimodal tasks. The study highlights AuM's efficiency and effectiveness in audio classification, offering a promising alternative to traditional transformer-based approaches.Audio Mamba (AuM) is a self-attention-free, state space model (SSM)-based architecture for audio classification. It aims to address the quadratic complexity of self-attention in Audio Spectrogram Transformers (ASTs) by using bidirectional SSMs to efficiently model long sequences. AuM is evaluated on six audio datasets, achieving performance comparable to or better than AST. The architecture processes audio spectrograms by dividing them into patches, adding a learnable classification token, and applying bidirectional SSMs for sequence modeling. AuM's design eliminates self-attention, enabling linear time complexity and efficient resource usage. It outperforms AST in terms of computational efficiency, with lower memory and faster inference times. Ablation studies show that placing the classification token in the middle of the sequence is optimal. AuM also demonstrates competitive performance when initialized with pre-trained weights, though it outperforms AST in in-domain pre-training. The model shows potential as a generic audio backbone, suitable for future applications in self-supervised learning and multimodal tasks. The study highlights AuM's efficiency and effectiveness in audio classification, offering a promising alternative to traditional transformer-based approaches.