This paper addresses the challenge of Synthetic Image Detection (SID) by leveraging representations from intermediate encoder blocks of CLIP's image-encoder. The authors propose the RINE model, which extracts low-level visual information from these intermediate layers and projects them into a learnable forgery-aware vector space. A Trainable Importance Estimator (TIE) module is also employed to weight the contributions of each Transformer block to the final prediction. The method is evaluated on 20 test datasets, including images generated by various generative models, and shows an average absolute performance improvement of +10.6% compared to state-of-the-art methods. Notably, the best-performing models require only a single epoch of training, which takes approximately 8 minutes. The code for the RINE model is available at https://github.com/mever-team/riine.
- **Synthetic Image Detection (SID)**: The detection of synthetic images generated by models like GANs and Diffusion models.
- **Feature Extraction**: Utilizing features from intermediate layers of foundation models like CLIP to capture low-level visual information.
- **Model Architecture**: RINE architecture processes images through CLIP's image encoder, extracts intermediate representations, and projects them into a forgery-aware vector space.
- **Trainable Importance Estimator (TIE)**: Adjusts the impact of each intermediate Transformer block on the final prediction.
- **Performance**: Achieves an average +10.6% absolute performance improvement over state-of-the-art methods on 20 test datasets.
- **Training Efficiency**: Requires only a single epoch of training, which takes approximately 8 minutes.
- **Robustness**: Shows high robustness against common image transformations like cropping, compression, blurring, and noise addition.
- **RINE Model**: Leverages intermediate representations from CLIP's image-encoder to enhance SID performance.
- **Trainable Importance Estimator (TIE)**: Adjusts the importance of each Transformer block in the final prediction.
- **Efficiency and Robustness**: Achieves high performance with minimal training time and robustness to common image perturbations.
- **RINE Model**: A lightweight and efficient approach for SID, leveraging intermediate representations from CLIP.
- **Trainable Importance Estimator (TIE)**: Enhances the model's ability to capture important features from intermediate layers.
- **Performance and Efficiency**: Demonstrates superior performance and efficiency compared to state-of-the-art methods.This paper addresses the challenge of Synthetic Image Detection (SID) by leveraging representations from intermediate encoder blocks of CLIP's image-encoder. The authors propose the RINE model, which extracts low-level visual information from these intermediate layers and projects them into a learnable forgery-aware vector space. A Trainable Importance Estimator (TIE) module is also employed to weight the contributions of each Transformer block to the final prediction. The method is evaluated on 20 test datasets, including images generated by various generative models, and shows an average absolute performance improvement of +10.6% compared to state-of-the-art methods. Notably, the best-performing models require only a single epoch of training, which takes approximately 8 minutes. The code for the RINE model is available at https://github.com/mever-team/riine.
- **Synthetic Image Detection (SID)**: The detection of synthetic images generated by models like GANs and Diffusion models.
- **Feature Extraction**: Utilizing features from intermediate layers of foundation models like CLIP to capture low-level visual information.
- **Model Architecture**: RINE architecture processes images through CLIP's image encoder, extracts intermediate representations, and projects them into a forgery-aware vector space.
- **Trainable Importance Estimator (TIE)**: Adjusts the impact of each intermediate Transformer block on the final prediction.
- **Performance**: Achieves an average +10.6% absolute performance improvement over state-of-the-art methods on 20 test datasets.
- **Training Efficiency**: Requires only a single epoch of training, which takes approximately 8 minutes.
- **Robustness**: Shows high robustness against common image transformations like cropping, compression, blurring, and noise addition.
- **RINE Model**: Leverages intermediate representations from CLIP's image-encoder to enhance SID performance.
- **Trainable Importance Estimator (TIE)**: Adjusts the importance of each Transformer block in the final prediction.
- **Efficiency and Robustness**: Achieves high performance with minimal training time and robustness to common image perturbations.
- **RINE Model**: A lightweight and efficient approach for SID, leveraging intermediate representations from CLIP.
- **Trainable Importance Estimator (TIE)**: Enhances the model's ability to capture important features from intermediate layers.
- **Performance and Efficiency**: Demonstrates superior performance and efficiency compared to state-of-the-art methods.