5 Jun 2024 | Trevine Oorloff, Surya Koppisetti, Ben Colman, Yaser Yacoob, Nicolò Bonettini, Ali Shahriyari, Divyaraj Solanki, Gaurav Bharaj
The paper "AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection" introduces a novel two-stage cross-modal learning method called Audio-Visual Feature Fusion (AVFF) to improve deepfake detection. The first stage involves self-supervised representation learning on real videos to capture the intrinsic audio-visual correspondences using contrastive learning and autoencoding objectives, along with a novel complementary masking and feature fusion strategy. The second stage tunes the learned representations for deepfake classification using supervised learning on both real and fake videos. Extensive experiments on the FakeAVCeleb dataset show that AVFF achieves 98.6% accuracy and 99.1% AUC, outperforming existing audio-visual state-of-the-art methods by 14.9% and 9.9%, respectively. The method leverages the inherent audio-visual correspondence in real videos, which is challenging to replicate in deepfake videos, to enhance detection performance. The paper also includes a detailed analysis of the learned representations and an ablation study to validate the effectiveness of each component of the AVFF framework.The paper "AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection" introduces a novel two-stage cross-modal learning method called Audio-Visual Feature Fusion (AVFF) to improve deepfake detection. The first stage involves self-supervised representation learning on real videos to capture the intrinsic audio-visual correspondences using contrastive learning and autoencoding objectives, along with a novel complementary masking and feature fusion strategy. The second stage tunes the learned representations for deepfake classification using supervised learning on both real and fake videos. Extensive experiments on the FakeAVCeleb dataset show that AVFF achieves 98.6% accuracy and 99.1% AUC, outperforming existing audio-visual state-of-the-art methods by 14.9% and 9.9%, respectively. The method leverages the inherent audio-visual correspondence in real videos, which is challenging to replicate in deepfake videos, to enhance detection performance. The paper also includes a detailed analysis of the learned representations and an ablation study to validate the effectiveness of each component of the AVFF framework.