Understanding Common Sense Reasoning for Deep Fake Detection

The paper introduces a novel task called Deepfake Detection VQA (DD-VQA) to enhance deepfake detection by incorporating human common sense reasoning. Traditional deepfake detection methods, which rely on image-based features extracted via neural networks, often struggle with detecting unnatural facial attributes such as blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. These attributes are easily perceived by humans but are challenging for image-based feature extraction methods to capture. To address this, the authors propose framing deepfake detection as a DD-VQA task, where models generate textual explanations based on common sense knowledge to label images as real or fake. The DD-VQA task involves generating answers to questions about the authenticity of images, including both real/fake decisions and corresponding textual reasons. The authors introduce a new annotated dataset, DD-VQA, which includes triplets of images, questions, and answers. They also propose a Vision and Language Transformer-based framework to train models on this task, incorporating text and image-aware feature alignment to enhance multi-modal representation learning. The paper evaluates the proposed method using extensive empirical results, demonstrating improved detection performance, generalization ability, and interpretability compared to existing deepfake detection models. The learned multi-modal representations are integrated into downstream deepfake detection models, enhancing their performance. The authors also provide qualitative examples and a robustness evaluation to validate the effectiveness of their approach.The paper introduces a novel task called Deepfake Detection VQA (DD-VQA) to enhance deepfake detection by incorporating human common sense reasoning. Traditional deepfake detection methods, which rely on image-based features extracted via neural networks, often struggle with detecting unnatural facial attributes such as blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. These attributes are easily perceived by humans but are challenging for image-based feature extraction methods to capture. To address this, the authors propose framing deepfake detection as a DD-VQA task, where models generate textual explanations based on common sense knowledge to label images as real or fake. The DD-VQA task involves generating answers to questions about the authenticity of images, including both real/fake decisions and corresponding textual reasons. The authors introduce a new annotated dataset, DD-VQA, which includes triplets of images, questions, and answers. They also propose a Vision and Language Transformer-based framework to train models on this task, incorporating text and image-aware feature alignment to enhance multi-modal representation learning. The paper evaluates the proposed method using extensive empirical results, demonstrating improved detection performance, generalization ability, and interpretability compared to existing deepfake detection models. The learned multi-modal representations are integrated into downstream deepfake detection models, enhancing their performance. The authors also provide qualitative examples and a robustness evaluation to validate the effectiveness of their approach.

Common Sense Reasoning for Deepfake Detection

18 Jul 2024 | Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, Gaurav Bharaj