INDUCING HIGH ENERGY-LATENCY OF LARGE VISION-LANGUAGE MODELS WITH VERBOSE IMAGES

INDUCING HIGH ENERGY-LATENCY OF LARGE VISION-LANGUAGE MODELS WITH VERBOSE IMAGES

2024 | Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu
This paper presents a method to induce high energy-latency costs in large vision-language models (VLMs) by crafting imperceptible perturbations, referred to as "verbose images." VLMs, such as GPT-4, have achieved impressive performance in multi-modal tasks but require significant computational resources. Malicious attacks that increase energy consumption and latency during inference can exhaust computational resources and reduce VLM availability. The authors propose verbose images to manipulate VLMs into generating longer sequences, thereby increasing energy-latency costs. The method involves three loss objectives: a delayed EOS loss to delay the occurrence of the end-of-sequence token, an uncertainty loss to increase output uncertainty, and a token diversity loss to enhance token diversity. A temporal weight adjustment algorithm is introduced to balance these objectives. Extensive experiments show that verbose images can increase the length of generated sequences by 7.87× and 8.56× compared to original images on MS-COCO and ImageNet datasets, respectively. These results demonstrate the effectiveness of verbose images in inducing high energy-latency costs. Additionally, verbose images produce dispersed attention on visual inputs and generate complex sequences with hallucinated content. The study highlights the need for methods specifically designed for VLMs to induce high energy-latency costs. The code for the proposed method is available at https://github.com/KuofengGao/Verbose_Images.This paper presents a method to induce high energy-latency costs in large vision-language models (VLMs) by crafting imperceptible perturbations, referred to as "verbose images." VLMs, such as GPT-4, have achieved impressive performance in multi-modal tasks but require significant computational resources. Malicious attacks that increase energy consumption and latency during inference can exhaust computational resources and reduce VLM availability. The authors propose verbose images to manipulate VLMs into generating longer sequences, thereby increasing energy-latency costs. The method involves three loss objectives: a delayed EOS loss to delay the occurrence of the end-of-sequence token, an uncertainty loss to increase output uncertainty, and a token diversity loss to enhance token diversity. A temporal weight adjustment algorithm is introduced to balance these objectives. Extensive experiments show that verbose images can increase the length of generated sequences by 7.87× and 8.56× compared to original images on MS-COCO and ImageNet datasets, respectively. These results demonstrate the effectiveness of verbose images in inducing high energy-latency costs. Additionally, verbose images produce dispersed attention on visual inputs and generate complex sequences with hallucinated content. The study highlights the need for methods specifically designed for VLMs to induce high energy-latency costs. The code for the proposed method is available at https://github.com/KuofengGao/Verbose_Images.
Reach us at info@study.space
[slides and audio] Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images