8 Jun 2024 | Hao Fang1† Jiawei Kong2† Wenbo Yu2 Bin Chen2# Jiawei Li3 Shu-Tao Xia1 Ke Xu4
The paper "One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models" addresses the vulnerability of Vision-Language Pre-training (VLP) models to adversarial attacks. While previous studies have shown that VLP models are susceptible to instance-specific adversarial samples, this paper introduces a new class of universal adversarial perturbations (UAPs) that can be applied to all input samples. The authors propose the Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC), a generative framework that incorporates cross-modal information to enhance the effectiveness of UAPs. C-PGC uses a conditional generator trained with a multimodal contrastive loss to produce UAPs that disrupt the multimodal feature alignment in VLP models. Extensive experiments on various VLP models and Vision-and-Language (V+L) tasks demonstrate the effectiveness and transferability of C-PGC, achieving high attack success rates and robustness against different defense strategies. The paper also highlights the practical significance of C-PGC in evaluating the adversarial robustness of VLP models.The paper "One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models" addresses the vulnerability of Vision-Language Pre-training (VLP) models to adversarial attacks. While previous studies have shown that VLP models are susceptible to instance-specific adversarial samples, this paper introduces a new class of universal adversarial perturbations (UAPs) that can be applied to all input samples. The authors propose the Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC), a generative framework that incorporates cross-modal information to enhance the effectiveness of UAPs. C-PGC uses a conditional generator trained with a multimodal contrastive loss to produce UAPs that disrupt the multimodal feature alignment in VLP models. Extensive experiments on various VLP models and Vision-and-Language (V+L) tasks demonstrate the effectiveness and transferability of C-PGC, achieving high attack success rates and robustness against different defense strategies. The paper also highlights the practical significance of C-PGC in evaluating the adversarial robustness of VLP models.