4 Mar 2024 | Yujie Yang†, Haochen Qin†, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han*, Yunhe Wang*
This paper proposes a robust audio deepfake detection (ADD) system using multi-view feature incorporation. The study investigates the generalizability of ADD systems by evaluating various audio features, including hand-crafted and learning-based features. Experiments show that learning-based features pretrained on large datasets generalize better than hand-crafted features in out-of-domain scenarios. To further improve generalizability, the authors propose two multi-view feature incorporation methods: feature selection and feature fusion. These methods incorporate complementary information from different features to enhance detection accuracy.
The study evaluates 14 audio features, including hand-crafted features like MFCC, LFCC, and CQT, and learning-based features such as SincNet, LEAF, EnCodec, AudioDec, Wav2Vec2, Hubert, WavLM, and AudioMAE. Results show that learning-based features, particularly Wav2Vec2 XLS-R and Hubert, achieve better performance on the In-the-Wild dataset, with EERs of 27.48 and 24.27, respectively. The feature fusion method further improves performance, reducing EER to 24.27. The feature selection method also enhances performance by using a sample-aware mask mechanism to select the most appropriate features for each sample.
The study concludes that learning-based features, especially those pretrained on large datasets, offer better generalization for ADD systems. The proposed multi-view feature incorporation methods significantly improve the generalizability and accuracy of the ADD system compared to single-feature approaches. The code for this work will be released soon.This paper proposes a robust audio deepfake detection (ADD) system using multi-view feature incorporation. The study investigates the generalizability of ADD systems by evaluating various audio features, including hand-crafted and learning-based features. Experiments show that learning-based features pretrained on large datasets generalize better than hand-crafted features in out-of-domain scenarios. To further improve generalizability, the authors propose two multi-view feature incorporation methods: feature selection and feature fusion. These methods incorporate complementary information from different features to enhance detection accuracy.
The study evaluates 14 audio features, including hand-crafted features like MFCC, LFCC, and CQT, and learning-based features such as SincNet, LEAF, EnCodec, AudioDec, Wav2Vec2, Hubert, WavLM, and AudioMAE. Results show that learning-based features, particularly Wav2Vec2 XLS-R and Hubert, achieve better performance on the In-the-Wild dataset, with EERs of 27.48 and 24.27, respectively. The feature fusion method further improves performance, reducing EER to 24.27. The feature selection method also enhances performance by using a sample-aware mask mechanism to select the most appropriate features for each sample.
The study concludes that learning-based features, especially those pretrained on large datasets, offer better generalization for ADD systems. The proposed multi-view feature incorporation methods significantly improve the generalizability and accuracy of the ADD system compared to single-feature approaches. The code for this work will be released soon.