5 Jun 2024 | Trevine Oorloff, Surya Koppiseti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
AVFF is a two-stage cross-modal learning method that explicitly captures the correspondence between audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.
The proposed method employs a novel complementary masking and cross-modal feature fusion strategy to explicitly capture the audio-visual correspondences. Previous literature on audio-visual video deepfake detection uses supervised contrastive learning to capture the audio-visual correspondence. Such methods align the audio and visual embeddings closer to each other, if the content in both modalities is real, and push them apart if either or both modalities are generative. Similarly, others pursue a single stage supervised learning method, where models are directly trained on labeled deepfake datasets for deepfake classification. While such methods yield promising results, we conjecture that they may not fully exploit the audio-visual correspondence. Also, training solely on a deepfake dataset narrows the model's focus to discern separable features within the training corpus, potentially overlooking subtle audio-visual correspondences that can help detect unseen deepfake samples.
To circumvent these issues, we propose a two-stage training pipeline comprising of (i) a self-supervised representation learning stage that explicitly enforces audio-visual correspondence using a novel approach, and (ii) a supervised downstream classification stage. In the representation learning stage, we extract audio-visual representations via self-supervised learning on real-face videos, which are available in abundance. Drawing inspirations from CAV-MAE, we make use of the complementary nature of two learning objectives: contrastive learning and autoencoding. For extracting rich representations, we supplement the contrastive learning objective by a novel audio-visual complementary masking and fusion strategy that sits within the autoencoding objective. In the classification stage, we train a classifier that exploits the lack of cohesion between audio-visual features of deepfake videos to separate them from real videos.
We evaluate our method against existing state-of-the-art approaches on multiple benchmarks. Our results reveal substantial improvements, when compared against the existing audio-visual state-of-the-art, enhancing the performance by 9.9% in AUC and 14.9% in accuracy when evaluated on the FakeAVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
AVFF is a two-stage cross-modal learning method that explicitly captures the correspondence between audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.
The proposed method employs a novel complementary masking and cross-modal feature fusion strategy to explicitly capture the audio-visual correspondences. Previous literature on audio-visual video deepfake detection uses supervised contrastive learning to capture the audio-visual correspondence. Such methods align the audio and visual embeddings closer to each other, if the content in both modalities is real, and push them apart if either or both modalities are generative. Similarly, others pursue a single stage supervised learning method, where models are directly trained on labeled deepfake datasets for deepfake classification. While such methods yield promising results, we conjecture that they may not fully exploit the audio-visual correspondence. Also, training solely on a deepfake dataset narrows the model's focus to discern separable features within the training corpus, potentially overlooking subtle audio-visual correspondences that can help detect unseen deepfake samples.
To circumvent these issues, we propose a two-stage training pipeline comprising of (i) a self-supervised representation learning stage that explicitly enforces audio-visual correspondence using a novel approach, and (ii) a supervised downstream classification stage. In the representation learning stage, we extract audio-visual representations via self-supervised learning on real-face videos, which are available in abundance. Drawing inspirations from CAV-MAE, we make use of the complementary nature of two learning objectives: contrastive learning and autoencoding. For extracting rich representations, we supplement the contrastive learning objective by a novel audio-visual complementary masking and fusion strategy that sits within the autoencoding objective. In the classification stage, we train a classifier that exploits the lack of cohesion between audio-visual features of deepfake videos to separate them from real videos.
We evaluate our method against existing state-of-the-art approaches on multiple benchmarks. Our results reveal substantial improvements, when compared against the existing audio-visual state-of-the-art, enhancing the performance by 9.9% in AUC and 14.9% in accuracy when evaluated on the Fake