CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

June, 2024 | Sohail Ahmed Khan, Duc-Tien Dang-Nguyen
This paper explores the effectiveness of pre-trained vision-language models (VLMs) for universal deepfake detection, focusing on adapting CLIP (Contrastive Language-Image Pre-training) using recent adaptation methods. The authors employ a single dataset (ProGAN) and highlight the importance of retaining both the visual and textual components of CLIP. They compare four transfer learning strategies: Fine-tuning, Linear Probing, Prompt Tuning, and Adapter Network. The proposed approach, Prompt Tuning, outperforms previous methods by 5.01% mAP and 6.61% accuracy while using less than one-third of the training data (200k images). The study evaluates the models on 21 different datasets, including GANs, Diffusion-based, and Commercial tools, demonstrating robust performance across various conditions such as few-shot learning, post-processing operations, and limited training data. The results show that the proposed methods consistently outperform previous baselines and state-of-the-art techniques, achieving superior generalization and robustness.This paper explores the effectiveness of pre-trained vision-language models (VLMs) for universal deepfake detection, focusing on adapting CLIP (Contrastive Language-Image Pre-training) using recent adaptation methods. The authors employ a single dataset (ProGAN) and highlight the importance of retaining both the visual and textual components of CLIP. They compare four transfer learning strategies: Fine-tuning, Linear Probing, Prompt Tuning, and Adapter Network. The proposed approach, Prompt Tuning, outperforms previous methods by 5.01% mAP and 6.61% accuracy while using less than one-third of the training data (200k images). The study evaluates the models on 21 different datasets, including GANs, Diffusion-based, and Commercial tools, demonstrating robust performance across various conditions such as few-shot learning, post-processing operations, and limited training data. The results show that the proposed methods consistently outperform previous baselines and state-of-the-art techniques, achieving superior generalization and robustness.
Reach us at info@study.space
[slides] CLIPping the Deception%3A Adapting Vision-Language Models for Universal Deepfake Detection | StudySpace