June 2024 | Sohail Ahmed Khan, Duc-Tien Dang-Nguyen
This paper explores the effectiveness of pre-trained vision-language models (VLMs) for universal deepfake detection. The authors investigate four transfer learning strategies: linear probing, fine-tuning, prompt tuning, and adapter network training. They focus on CLIP, a pre-trained vision-language model, and evaluate its performance in detecting deepfakes. The study uses the ProGAN dataset, which contains 720k real/fake images, but the authors train their models on only 200k images. They find that prompt tuning, which adapts both the visual and text components of CLIP, outperforms other methods by 5.01% in mAP and 6.61% in accuracy. The authors also test their models on images generated by various deepfake generators, including GANs, diffusion models, and commercial tools. They find that their models perform well across different datasets and are robust to post-processing operations like JPEG compression and Gaussian blurring. The study highlights the importance of leveraging both the visual and text components of CLIP for effective deepfake detection. The authors also conduct few-shot experiments, demonstrating that their models can achieve high performance even with limited training data. The results show that prompt tuning is the most effective strategy for deepfake detection, achieving significant improvements over previous state-of-the-art methods. The authors make their code and pre-trained models available for further research.This paper explores the effectiveness of pre-trained vision-language models (VLMs) for universal deepfake detection. The authors investigate four transfer learning strategies: linear probing, fine-tuning, prompt tuning, and adapter network training. They focus on CLIP, a pre-trained vision-language model, and evaluate its performance in detecting deepfakes. The study uses the ProGAN dataset, which contains 720k real/fake images, but the authors train their models on only 200k images. They find that prompt tuning, which adapts both the visual and text components of CLIP, outperforms other methods by 5.01% in mAP and 6.61% in accuracy. The authors also test their models on images generated by various deepfake generators, including GANs, diffusion models, and commercial tools. They find that their models perform well across different datasets and are robust to post-processing operations like JPEG compression and Gaussian blurring. The study highlights the importance of leveraging both the visual and text components of CLIP for effective deepfake detection. The authors also conduct few-shot experiments, demonstrating that their models can achieve high performance even with limited training data. The results show that prompt tuning is the most effective strategy for deepfake detection, achieving significant improvements over previous state-of-the-art methods. The authors make their code and pre-trained models available for further research.