22 Mar 2024 | Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu
This paper explores the vulnerability of large vision-language models (VLMs) to energy-latency attacks, where attackers can induce high energy consumption and latency during inference. The authors propose *verbatim images*, a method to craft imperceptible perturbations that force VLMs to generate longer sequences, thereby increasing energy consumption and latency. The key contributions include:
1. **Verbatim Images**: A technique to create perturbations that delay the end-of-sequence (EOS) token, increase output uncertainty, and improve token diversity, leading to longer generated sequences.
2. **Loss Objectives**: Three loss functions—Delayed EOS loss, Uncertainty loss, and Token Diversity loss—are designed to optimize the perturbations.
3. **Temporal Weight Adjustment**: An algorithm to balance the optimization of the three loss objectives.
4. **Experiments**: Extensive experiments on MS-COCO and ImageNet datasets demonstrate that verbose images can increase the length of generated sequences by 7.87× and 8.56×, respectively, compared to original images, significantly increasing energy consumption and latency.
The paper highlights the importance of considering energy-latency costs in the deployment of VLMs and provides a baseline for future research on such attacks.This paper explores the vulnerability of large vision-language models (VLMs) to energy-latency attacks, where attackers can induce high energy consumption and latency during inference. The authors propose *verbatim images*, a method to craft imperceptible perturbations that force VLMs to generate longer sequences, thereby increasing energy consumption and latency. The key contributions include:
1. **Verbatim Images**: A technique to create perturbations that delay the end-of-sequence (EOS) token, increase output uncertainty, and improve token diversity, leading to longer generated sequences.
2. **Loss Objectives**: Three loss functions—Delayed EOS loss, Uncertainty loss, and Token Diversity loss—are designed to optimize the perturbations.
3. **Temporal Weight Adjustment**: An algorithm to balance the optimization of the three loss objectives.
4. **Experiments**: Extensive experiments on MS-COCO and ImageNet datasets demonstrate that verbose images can increase the length of generated sequences by 7.87× and 8.56×, respectively, compared to original images, significantly increasing energy consumption and latency.
The paper highlights the importance of considering energy-latency costs in the deployment of VLMs and provides a baseline for future research on such attacks.