[slides] A lightweight feature extraction technique for deepfake audio detection

The paper introduces a lightweight feature extraction technique for deepfake audio detection, addressing the growing concern over the authenticity and reliability of audio deepfakes. The authors propose an improved method that utilizes a modified ResNet50 model on audio Mel spectrograms to extract features. These features are then reduced in dimensionality using Linear Discriminant Analysis (LDA) to optimize complexity. The selected features are used to train machine learning (ML) models, including Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbour (KNN), and Naive Bayes (NB). The ASVspoof 2019 Logical Access (LA) partition is used for training, while the ASVspoof 2021 deep fake partition is used for evaluation. The DECRO dataset is also utilized to evaluate the proposed model under unseen noisy conditions. The method outperforms traditional feature extraction techniques like Mel Frequency Cepstral Coefficients (MFCC) and Gammatone Cepstral Coefficients (GTCC), achieving an Equal Error Rate (EER) of 0.4% and an accuracy of 99.7%. The introduction highlights the rapid spread of information through social media platforms and the potential for deepfakes to cause widespread disinformation. It discusses the challenges posed by recent deepfake generation algorithms, such as text-to-speech, which can be indistinguishable from human speech. The paper also reviews various feature extraction techniques and their limitations, emphasizing the need for more robust methods. The classification models used, including Hidden Markov Models (HMM), Gaussian Mixture Models (GMM), and SVM, are discussed, along with their applications in speaker verification and ASV tasks.The paper introduces a lightweight feature extraction technique for deepfake audio detection, addressing the growing concern over the authenticity and reliability of audio deepfakes. The authors propose an improved method that utilizes a modified ResNet50 model on audio Mel spectrograms to extract features. These features are then reduced in dimensionality using Linear Discriminant Analysis (LDA) to optimize complexity. The selected features are used to train machine learning (ML) models, including Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbour (KNN), and Naive Bayes (NB). The ASVspoof 2019 Logical Access (LA) partition is used for training, while the ASVspoof 2021 deep fake partition is used for evaluation. The DECRO dataset is also utilized to evaluate the proposed model under unseen noisy conditions. The method outperforms traditional feature extraction techniques like Mel Frequency Cepstral Coefficients (MFCC) and Gammatone Cepstral Coefficients (GTCC), achieving an Equal Error Rate (EER) of 0.4% and an accuracy of 99.7%. The introduction highlights the rapid spread of information through social media platforms and the potential for deepfakes to cause widespread disinformation. It discusses the challenges posed by recent deepfake generation algorithms, such as text-to-speech, which can be indistinguishable from human speech. The paper also reviews various feature extraction techniques and their limitations, emphasizing the need for more robust methods. The classification models used, including Hidden Markov Models (HMM), Gaussian Mixture Models (GMM), and SVM, are discussed, along with their applications in speaker verification and ASV tasks.

A lightweight feature extraction technique for deepfake audio detection

25 January 2024 | Nidhi Chakravarty, Mohit Dua