18 Jun 2024 | Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan
The paper introduces RawBMamba, an end-to-end bidirectional state space model designed for audio deepfake detection. The model aims to capture both short- and long-range discriminative information to enhance the system's detection capability. RawBMamba utilizes sinc Layers and multiple convolutional layers to extract short-range features, followed by a bidirectional Mamba model to address the unidirectional modeling issue and capture long-range feature information. A bidirectional fusion module integrates the embeddings from both directions, improving audio context representation and combining short- and long-range information. Experimental results on the ASVspoof2021 LA dataset show that RawBMamba achieves a 34.1% improvement over Rawformer, demonstrating competitive performance on other datasets. The model's effectiveness is further validated through t-SNE visualization, which shows that Mamba architecture captures more discriminative features compared to the Transformer architecture. The paper concludes by highlighting RawBMamba's robustness and generalizability, suggesting its potential as a backbone model in audio deepfake detection.The paper introduces RawBMamba, an end-to-end bidirectional state space model designed for audio deepfake detection. The model aims to capture both short- and long-range discriminative information to enhance the system's detection capability. RawBMamba utilizes sinc Layers and multiple convolutional layers to extract short-range features, followed by a bidirectional Mamba model to address the unidirectional modeling issue and capture long-range feature information. A bidirectional fusion module integrates the embeddings from both directions, improving audio context representation and combining short- and long-range information. Experimental results on the ASVspoof2021 LA dataset show that RawBMamba achieves a 34.1% improvement over Rawformer, demonstrating competitive performance on other datasets. The model's effectiveness is further validated through t-SNE visualization, which shows that Mamba architecture captures more discriminative features compared to the Transformer architecture. The paper concludes by highlighting RawBMamba's robustness and generalizability, suggesting its potential as a backbone model in audio deepfake detection.