[slides] A Robust Audio Deepfake Detection System via Multi-View Feature

This paper addresses the challenge of audio deepfake detection (ADD) in the context of advancing generative modeling techniques that produce synthetic human speech indistinguishable from real. The authors investigate a broad range of audio features, including handcrafted and learning-based features, to improve the generalizability of ADD systems. Experiments show that learning-based features, pre-trained on large datasets, outperform handcrafted features in out-of-domain scenarios. To further enhance generalizability, the authors propose two multi-view feature incorporation methods: feature selection and feature fusion. These methods leverage complementary information from different feature views to improve the detection accuracy of deepfake samples, especially those generated by unknown synthesis systems. The model trained on the ASV2019 dataset achieves an equal error rate (EER) of 24.27% on the In-the-Wild dataset, demonstrating the effectiveness of the proposed approaches. The paper concludes by highlighting the superior generalizability of deep features and the benefits of multi-view feature incorporation in improving ADD system performance.This paper addresses the challenge of audio deepfake detection (ADD) in the context of advancing generative modeling techniques that produce synthetic human speech indistinguishable from real. The authors investigate a broad range of audio features, including handcrafted and learning-based features, to improve the generalizability of ADD systems. Experiments show that learning-based features, pre-trained on large datasets, outperform handcrafted features in out-of-domain scenarios. To further enhance generalizability, the authors propose two multi-view feature incorporation methods: feature selection and feature fusion. These methods leverage complementary information from different feature views to improve the detection accuracy of deepfake samples, especially those generated by unknown synthesis systems. The model trained on the ASV2019 dataset achieves an equal error rate (EER) of 24.27% on the In-the-Wild dataset, demonstrating the effectiveness of the proposed approaches. The paper concludes by highlighting the superior generalizability of deep features and the benefits of multi-view feature incorporation in improving ADD system performance.

A ROBUST AUDIO DEEPFAKE DETECTION SYSTEM VIA MULTI-VIEW FEATURE

4 Mar 2024 | Yujie Yang†, Haochen Qin†, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han, Yunhe Wang

A ROBUST AUDIO DEEPFAKE DETECTION SYSTEM VIA MULTI-VIEW FEATURE

4 Mar 2024 | Yujie Yang†, Haochen Qin†, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han*, Yunhe Wang*

4 Mar 2024 | Yujie Yang†, Haochen Qin†, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han, Yunhe Wang