Leveraging Representations from Intermediate Encoder-Blocks for Synthetic Image Detection

Leveraging Representations from Intermediate Encoder-Blocks for Synthetic Image Detection

2024 | Christos Koutlis, Symeon Papadopoulos
The paper introduces RINE, a synthetic image detection method that leverages intermediate encoder-block representations from CLIP to improve detection performance. The authors propose a model that uses these representations through a lightweight network to map them to a forgery-aware vector space, and incorporates a trainable module to assess the importance of each Transformer block in the final prediction. The method is evaluated on 20 test datasets and achieves an average +10.6% improvement in performance compared to the state-of-the-art. The best performing models require only a single epoch for training (approximately 8 minutes). The method is robust to common image transformations and performs well even with limited training data. The results show that intermediate representations are more effective for synthetic image detection than features from the final layer. The paper also discusses the impact of training duration, training set size, and the importance of intermediate stages in the model. The proposed method is effective in detecting synthetic images generated by various models, including diffusion models. The results demonstrate that the method outperforms existing approaches in terms of accuracy and average precision. The paper concludes that the proposed approach is effective for synthetic image detection and requires minimal training time and data.The paper introduces RINE, a synthetic image detection method that leverages intermediate encoder-block representations from CLIP to improve detection performance. The authors propose a model that uses these representations through a lightweight network to map them to a forgery-aware vector space, and incorporates a trainable module to assess the importance of each Transformer block in the final prediction. The method is evaluated on 20 test datasets and achieves an average +10.6% improvement in performance compared to the state-of-the-art. The best performing models require only a single epoch for training (approximately 8 minutes). The method is robust to common image transformations and performs well even with limited training data. The results show that intermediate representations are more effective for synthetic image detection than features from the final layer. The paper also discusses the impact of training duration, training set size, and the importance of intermediate stages in the model. The proposed method is effective in detecting synthetic images generated by various models, including diffusion models. The results demonstrate that the method outperforms existing approaches in terms of accuracy and average precision. The paper concludes that the proposed approach is effective for synthetic image detection and requires minimal training time and data.
Reach us at info@study.space
[slides and audio] Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection