8 Jun 2024 | Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Shu-Tao Xia, Ke Xu
This paper presents a method for generating universal adversarial perturbations (UAPs) against Vision-Language Pre-training (VLP) models. The authors propose a contrastive-training perturbation generator with cross-modal conditions (C-PGC) that effectively generates UAPs that can deceive VLP models across various tasks. The method is designed to generate perturbations that disrupt the alignment between image and text modalities, making it difficult for VLP models to make accurate predictions. The C-PGC is trained using a combination of multimodal contrastive learning and unimodal distance loss, which helps to enhance the attack performance and transferability of the generated UAPs. The method is evaluated on various VLP models and V+L tasks, including image-text retrieval, image captioning, visual grounding, and visual entailment. The results show that the proposed method achieves high attack success rates and demonstrates strong black-box transferability. The method is also tested on large VLP models such as LLaVA and Qwen-VL, showing its effectiveness in fooling these models. The authors conclude that their method is a significant advancement in the field of adversarial attacks against VLP models.This paper presents a method for generating universal adversarial perturbations (UAPs) against Vision-Language Pre-training (VLP) models. The authors propose a contrastive-training perturbation generator with cross-modal conditions (C-PGC) that effectively generates UAPs that can deceive VLP models across various tasks. The method is designed to generate perturbations that disrupt the alignment between image and text modalities, making it difficult for VLP models to make accurate predictions. The C-PGC is trained using a combination of multimodal contrastive learning and unimodal distance loss, which helps to enhance the attack performance and transferability of the generated UAPs. The method is evaluated on various VLP models and V+L tasks, including image-text retrieval, image captioning, visual grounding, and visual entailment. The results show that the proposed method achieves high attack success rates and demonstrates strong black-box transferability. The method is also tested on large VLP models such as LLaVA and Qwen-VL, showing its effectiveness in fooling these models. The authors conclude that their method is a significant advancement in the field of adversarial attacks against VLP models.