Understanding Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

This paper investigates the vulnerability of multi-modal large language models (MLLMs) to energy-latency manipulation, particularly for image-based and video-based models. The authors propose *verbose samples*, including *verbose images* and *videos*, to induce high energy-latency costs during inference by crafting imperceptible perturbations. They find that maximizing the length of generated sequences can lead to higher energy consumption and latency time. To achieve this, two modality non-specific losses are introduced: a delayed EOS loss to delay the end-of-sequence token and an uncertainty loss to increase output uncertainty. Additionally, modality-specific losses are proposed for verbose images and videos to improve token diversity and frame feature diversity, respectively. A temporal weight adjustment algorithm is also developed to balance these losses. Experiments demonstrate that verbose samples can significantly extend the length of generated sequences, with verbose images increasing the length by 7.87× and 8.56× on MS-COCO and ImageNet datasets, and verbose videos increasing the length by 4.04× and 4.14× on MSVD and TGIF datasets. The paper also discusses the effectiveness of verbose samples in generating complex and hallucinated sequences, and provides a unified interpretation framework to explain the mechanisms behind the energy-latency manipulation.This paper investigates the vulnerability of multi-modal large language models (MLLMs) to energy-latency manipulation, particularly for image-based and video-based models. The authors propose *verbose samples*, including *verbose images* and *videos*, to induce high energy-latency costs during inference by crafting imperceptible perturbations. They find that maximizing the length of generated sequences can lead to higher energy consumption and latency time. To achieve this, two modality non-specific losses are introduced: a delayed EOS loss to delay the end-of-sequence token and an uncertainty loss to increase output uncertainty. Additionally, modality-specific losses are proposed for verbose images and videos to improve token diversity and frame feature diversity, respectively. A temporal weight adjustment algorithm is also developed to balance these losses. Experiments demonstrate that verbose samples can significantly extend the length of generated sequences, with verbose images increasing the length by 7.87× and 8.56× on MS-COCO and ImageNet datasets, and verbose videos increasing the length by 4.04× and 4.14× on MSVD and TGIF datasets. The paper also discusses the effectiveness of verbose samples in generating complex and hallucinated sequences, and provides a unified interpretation framework to explain the mechanisms behind the energy-latency manipulation.

Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

25 Apr 2024 | Kuofeng Gao*, Jindong Gu*, Yang Bai, Shu-Tao Xia†, Philip Torr, Senior Member, IEEE, Wei Liu, Fellow, IEEE, Zhifeng Li†, Senior Member, IEEE

25 Apr 2024 | Kuofeng Gao, Jindong Gu, Yang Bai, Shu-Tao Xia†, Philip Torr, Senior Member, IEEE, Wei Liu, Fellow, IEEE, Zhifeng Li†, Senior Member, IEEE