[slides and audio] Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

This paper explores the vulnerability of large vision-language models (VLMs) to energy-latency attacks, where attackers can induce high energy consumption and latency during inference. The authors propose *verbatim images*, a method to craft imperceptible perturbations that force VLMs to generate longer sequences, thereby increasing energy consumption and latency. The key contributions include: 1. **Verbatim Images**: A technique to create perturbations that delay the end-of-sequence (EOS) token, increase output uncertainty, and improve token diversity, leading to longer generated sequences. 2. **Loss Objectives**: Three loss functions—Delayed EOS loss, Uncertainty loss, and Token Diversity loss—are designed to optimize the perturbations. 3. **Temporal Weight Adjustment**: An algorithm to balance the optimization of the three loss objectives. 4. **Experiments**: Extensive experiments on MS-COCO and ImageNet datasets demonstrate that verbose images can increase the length of generated sequences by 7.87× and 8.56×, respectively, compared to original images, significantly increasing energy consumption and latency. The paper highlights the importance of considering energy-latency costs in the deployment of VLMs and provides a baseline for future research on such attacks.This paper explores the vulnerability of large vision-language models (VLMs) to energy-latency attacks, where attackers can induce high energy consumption and latency during inference. The authors propose *verbatim images*, a method to craft imperceptible perturbations that force VLMs to generate longer sequences, thereby increasing energy consumption and latency. The key contributions include: 1. **Verbatim Images**: A technique to create perturbations that delay the end-of-sequence (EOS) token, increase output uncertainty, and improve token diversity, leading to longer generated sequences. 2. **Loss Objectives**: Three loss functions—Delayed EOS loss, Uncertainty loss, and Token Diversity loss—are designed to optimize the perturbations. 3. **Temporal Weight Adjustment**: An algorithm to balance the optimization of the three loss objectives. 4. **Experiments**: Extensive experiments on MS-COCO and ImageNet datasets demonstrate that verbose images can increase the length of generated sequences by 7.87× and 8.56×, respectively, compared to original images, significantly increasing energy consumption and latency. The paper highlights the importance of considering energy-latency costs in the deployment of VLMs and provides a baseline for future research on such attacks.

INDUCING HIGH ENERGY-LATENCY OF LARGE VISION-LANGUAGE MODELS WITH VERBOSE IMAGES

22 Mar 2024 | Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu