Audio Anti-Spoofing Detection: A Survey

Audio Anti-Spoofing Detection: A Survey

22 Apr 2024 | MENGLU LI, YASAMAN AHMADIADLI, and XIAO-PING ZHANG
This survey paper provides a comprehensive review of audio anti-spoofing detection, covering algorithm architectures, optimization techniques, application generalizability, evaluation metrics, performance comparisons, available datasets, and open-source availability. It discusses recent advancements in audio anti-spoofing detection, including partial spoofing detection, cross-dataset evaluation, and adversarial attack defense. The paper also explores emerging research topics and proposes promising research directions for future work. It identifies the current state-of-the-art to establish strong baselines for future experiments and guides future researchers on a clear path for understanding and enhancing audio anti-spoofing detection mechanisms. The paper presents a detailed summary of the datasets and evaluation metrics used in audio anti-spoofing detection. It categorizes datasets into fully spoofed, partially spoofed, and fully real datasets. The fully spoofed datasets include ASVspoof2019-LA, ASVspoof2021-LA, ASVspoof2021-DF, and FakeorReal-original. The partially spoofed datasets include Partial Synthetic Detection (Psynd), PartialSpoof, ADD2022-PF, ADD2023-PF, and Half-Truth (HAD). The fully real datasets include VCTK, LibriSpeech, VoxCeleb2, and LJ Speech. The paper evaluates various evaluation metrics, including Equal Error Rate (EER), F1-score, Accuracy, and Tandem Detection Cost Function (t-DCF). It discusses the strengths and limitations of these metrics and emphasizes metrics tailored to address the specific challenges of detecting partially spoofed content. The paper comprehensively evaluates every component within the detection pipeline for fully spoofed audio, including algorithm architectures and training optimization techniques. It categorizes the current methodologies of feature extraction into three groups: hand-crafted traditional spectral features, deep-learning features, and other analysis-oriented approaches. It discusses various types of deep-learning features, including filter-learning features, supervised embedding, pre-trained embedding, and other analysis-oriented features. The paper also discusses the classifier architecture, including traditional machine learning classifiers, convolutional neural networks (CNN), residual networks (ResNet), and graph neural networks (GNN). It evaluates the strengths and limitations of different classifier structures and highlights the effectiveness of various architectures in audio anti-spoofing detection.This survey paper provides a comprehensive review of audio anti-spoofing detection, covering algorithm architectures, optimization techniques, application generalizability, evaluation metrics, performance comparisons, available datasets, and open-source availability. It discusses recent advancements in audio anti-spoofing detection, including partial spoofing detection, cross-dataset evaluation, and adversarial attack defense. The paper also explores emerging research topics and proposes promising research directions for future work. It identifies the current state-of-the-art to establish strong baselines for future experiments and guides future researchers on a clear path for understanding and enhancing audio anti-spoofing detection mechanisms. The paper presents a detailed summary of the datasets and evaluation metrics used in audio anti-spoofing detection. It categorizes datasets into fully spoofed, partially spoofed, and fully real datasets. The fully spoofed datasets include ASVspoof2019-LA, ASVspoof2021-LA, ASVspoof2021-DF, and FakeorReal-original. The partially spoofed datasets include Partial Synthetic Detection (Psynd), PartialSpoof, ADD2022-PF, ADD2023-PF, and Half-Truth (HAD). The fully real datasets include VCTK, LibriSpeech, VoxCeleb2, and LJ Speech. The paper evaluates various evaluation metrics, including Equal Error Rate (EER), F1-score, Accuracy, and Tandem Detection Cost Function (t-DCF). It discusses the strengths and limitations of these metrics and emphasizes metrics tailored to address the specific challenges of detecting partially spoofed content. The paper comprehensively evaluates every component within the detection pipeline for fully spoofed audio, including algorithm architectures and training optimization techniques. It categorizes the current methodologies of feature extraction into three groups: hand-crafted traditional spectral features, deep-learning features, and other analysis-oriented approaches. It discusses various types of deep-learning features, including filter-learning features, supervised embedding, pre-trained embedding, and other analysis-oriented features. The paper also discusses the classifier architecture, including traditional machine learning classifiers, convolutional neural networks (CNN), residual networks (ResNet), and graph neural networks (GNN). It evaluates the strengths and limitations of different classifier structures and highlights the effectiveness of various architectures in audio anti-spoofing detection.
Reach us at info@study.space