18 Jun 2024 | Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
**VoCo-LLaMA: Towards Vision Compression with Large Language Models**
**Authors:** Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
**Institution:** Tsinghua Shenzhen International Graduate School, Tsinghua University; ARC Lab, Tencent PCG; UC Santa Cruz
**Abstract:**
Vision-Language Models (VLMs) have achieved significant success in multi-modal tasks but are often bottlenecked by limited context windows and high computational costs when processing high-resolution images and videos. Vision compression can alleviate this issue by reducing the number of vision tokens. Previous methods compress vision tokens using external modules, leading to visual information loss. VoCo-LLaMA is the first approach to compress vision tokens using LLMs. By introducing Vision Compression (VoCo) tokens during the vision instruction tuning phase and leveraging attention distillation, VoCo-LLaMA distills how LLMs comprehend vision tokens into their processing of compressed tokens. This method achieves minimal performance loss with a compression ratio of 576×, resulting in up to 94.8% fewer FLOPs and 69.6% acceleration in inference time. VoCo-LLaMA also demonstrates the ability to understand temporal correlations in video data, outperforming previous methods on popular video question-answering benchmarks.
**Introduction:**
VLMs have advanced visual understanding capabilities, but the large number of vision tokens occupies a significant portion of the context window, leading to high computational costs. Previous methods compress vision tokens using external modules, causing visual information loss. VoCo-LLaMA introduces VoCo tokens to isolate and compress vision tokens, leveraging the LLM's ability to understand compressed tokens. This method achieves efficient representation of visual information and improves computational efficiency during inference.
**Method:**
VoCo-LLaMA uses a two-stage attention mechanism to isolate and compress vision tokens into VoCo tokens. The optimization objective minimizes the loss between the original model's output and the compressed model's output. VoCo-LLaMA can be trained using standard visual instruction tuning and can cache and reuse compressed transformer activations, improving inference efficiency.
**Experiments:**
VoCo-LLaMA achieves high compression retention rates on various visual understanding benchmarks, outperforming previous methods. It also demonstrates strong performance in video understanding tasks, achieving competitive results even with fewer vision tokens per frame.
**Conclusion:**
VoCo-LLaMA offers a promising solution for efficient vision compression, enabling VLMs to handle more scalable multi-modal applications.**VoCo-LLaMA: Towards Vision Compression with Large Language Models**
**Authors:** Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
**Institution:** Tsinghua Shenzhen International Graduate School, Tsinghua University; ARC Lab, Tencent PCG; UC Santa Cruz
**Abstract:**
Vision-Language Models (VLMs) have achieved significant success in multi-modal tasks but are often bottlenecked by limited context windows and high computational costs when processing high-resolution images and videos. Vision compression can alleviate this issue by reducing the number of vision tokens. Previous methods compress vision tokens using external modules, leading to visual information loss. VoCo-LLaMA is the first approach to compress vision tokens using LLMs. By introducing Vision Compression (VoCo) tokens during the vision instruction tuning phase and leveraging attention distillation, VoCo-LLaMA distills how LLMs comprehend vision tokens into their processing of compressed tokens. This method achieves minimal performance loss with a compression ratio of 576×, resulting in up to 94.8% fewer FLOPs and 69.6% acceleration in inference time. VoCo-LLaMA also demonstrates the ability to understand temporal correlations in video data, outperforming previous methods on popular video question-answering benchmarks.
**Introduction:**
VLMs have advanced visual understanding capabilities, but the large number of vision tokens occupies a significant portion of the context window, leading to high computational costs. Previous methods compress vision tokens using external modules, causing visual information loss. VoCo-LLaMA introduces VoCo tokens to isolate and compress vision tokens, leveraging the LLM's ability to understand compressed tokens. This method achieves efficient representation of visual information and improves computational efficiency during inference.
**Method:**
VoCo-LLaMA uses a two-stage attention mechanism to isolate and compress vision tokens into VoCo tokens. The optimization objective minimizes the loss between the original model's output and the compressed model's output. VoCo-LLaMA can be trained using standard visual instruction tuning and can cache and reuse compressed transformer activations, improving inference efficiency.
**Experiments:**
VoCo-LLaMA achieves high compression retention rates on various visual understanding benchmarks, outperforming previous methods. It also demonstrates strong performance in video understanding tasks, achieving competitive results even with fewer vision tokens per frame.
**Conclusion:**
VoCo-LLaMA offers a promising solution for efficient vision compression, enabling VLMs to handle more scalable multi-modal applications.