18 Jun 2024 | Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
VoCo-LLaMA is a novel approach for vision compression using large language models (LLMs). It introduces Vision Compression (VoCo) tokens to enable LLMs to compress and understand visual information efficiently. By modifying the attention mechanism, VoCo-LLaMA isolates visual and text tokens, allowing the LLM to distill visual information into compact VoCo tokens. This method achieves significant compression ratios, up to 576×, with minimal performance loss, reducing FLOPs by 94.8% and inference time by 69.6%. VoCo-LLaMA also enables efficient inference by caching compressed transformer activations, improving computational efficiency and reducing storage requirements. The method demonstrates strong performance on video question-answering benchmarks, outperforming previous methods by leveraging temporal correlations in compressed video tokens. VoCo-LLaMA eliminates the need for specialized text-vision cross-modal fusion modules, enabling more scalable multi-modal applications. The approach effectively compresses vision tokens while maintaining high performance, making it a promising solution for unlocking the full potential of vision-language models (VLMs).VoCo-LLaMA is a novel approach for vision compression using large language models (LLMs). It introduces Vision Compression (VoCo) tokens to enable LLMs to compress and understand visual information efficiently. By modifying the attention mechanism, VoCo-LLaMA isolates visual and text tokens, allowing the LLM to distill visual information into compact VoCo tokens. This method achieves significant compression ratios, up to 576×, with minimal performance loss, reducing FLOPs by 94.8% and inference time by 69.6%. VoCo-LLaMA also enables efficient inference by caching compressed transformer activations, improving computational efficiency and reducing storage requirements. The method demonstrates strong performance on video question-answering benchmarks, outperforming previous methods by leveraging temporal correlations in compressed video tokens. VoCo-LLaMA eliminates the need for specialized text-vision cross-modal fusion modules, enabling more scalable multi-modal applications. The approach effectively compresses vision tokens while maintaining high performance, making it a promising solution for unlocking the full potential of vision-language models (VLMs).