Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

2024 | Kuofeng Gao, Jindong Gu, Yang Bai, Shu-Tao Xia, Philip Torr, Wei Liu, Zhifeng Li
This paper investigates the vulnerability of multi-modal large language models (MLLMs) to energy-latency manipulation, where malicious users can induce high energy consumption and latency during inference. The authors propose "verbose samples," which are imperceptible perturbations designed to increase the length of generated sequences, thereby maximizing energy-latency cost. They introduce two modality non-specific losses: a delayed EOS loss to delay the end-of-sequence token and an uncertainty loss to increase output uncertainty. Additionally, modality-specific losses are proposed to enhance diversity in generated sequences, including token diversity for images and frame feature diversity for videos. A temporal weight adjustment algorithm is used to balance these losses. Experiments show that verbose samples significantly increase the length of generated sequences for both image-based and video-based LLMs, with verbose images increasing sequence length by 7.87× and 8.56× on MS-COCO and ImageNet, and verbose videos by 4.04× and 4.14× on MSVD and TGIF. The study also highlights the importance of modality-specific approaches for video-based LLMs and provides a unified interpretation framework for energy-latency manipulation in both image and video modalities. The results demonstrate the effectiveness of verbose samples in inducing high energy-latency cost while maintaining imperceptibility.This paper investigates the vulnerability of multi-modal large language models (MLLMs) to energy-latency manipulation, where malicious users can induce high energy consumption and latency during inference. The authors propose "verbose samples," which are imperceptible perturbations designed to increase the length of generated sequences, thereby maximizing energy-latency cost. They introduce two modality non-specific losses: a delayed EOS loss to delay the end-of-sequence token and an uncertainty loss to increase output uncertainty. Additionally, modality-specific losses are proposed to enhance diversity in generated sequences, including token diversity for images and frame feature diversity for videos. A temporal weight adjustment algorithm is used to balance these losses. Experiments show that verbose samples significantly increase the length of generated sequences for both image-based and video-based LLMs, with verbose images increasing sequence length by 7.87× and 8.56× on MS-COCO and ImageNet, and verbose videos by 4.04× and 4.14× on MSVD and TGIF. The study also highlights the importance of modality-specific approaches for video-based LLMs and provides a unified interpretation framework for energy-latency manipulation in both image and video modalities. The results demonstrate the effectiveness of verbose samples in inducing high energy-latency cost while maintaining imperceptibility.
Reach us at info@study.space