TokenPacker: Efficient Visual Projector for Multimodal LLM

TokenPacker: Efficient Visual Projector for Multimodal LLM

28 Aug 2024 | Wentong Li*, Yuqian Yuan*, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang
TokenPacker is a novel visual projector designed for Multimodal Large Language Models (MLLMs) to efficiently compress visual tokens while preserving detailed information. The visual projector bridges the gap between the visual encoder and the LLM, enabling efficient visual reasoning. TokenPacker employs a coarse-to-fine approach, where low-resolution visual features are interpolated and refined using high-resolution, multi-level region-based cues to generate compact visual tokens. This method significantly reduces the number of visual tokens, achieving up to 89% compression while maintaining or improving performance across various benchmarks. The approach includes a dynamic image slicing scheme that supports input images of any aspect ratio, ensuring efficient high-resolution image understanding. Extensive experiments on diverse benchmarks demonstrate that TokenPacker outperforms existing methods in terms of efficiency and accuracy. It achieves a 75% reduction in visual tokens for LLaVA-1.5 while maintaining competitive performance. The method also excels in high-resolution image understanding, achieving state-of-the-art results on OCR-related benchmarks. TokenPacker's effectiveness is validated through ablation studies, showing that the coarse-to-fine design and dynamic image slicing significantly enhance visual token representation and overall performance. The method is efficient, with a high throughput, and demonstrates robust performance on comprehensive benchmarks and VQA-related tasks. The results highlight the importance of leveraging high-resolution imagery in multimodal tasks and the efficacy of TokenPacker in enhancing visual token representation and overall performance.TokenPacker is a novel visual projector designed for Multimodal Large Language Models (MLLMs) to efficiently compress visual tokens while preserving detailed information. The visual projector bridges the gap between the visual encoder and the LLM, enabling efficient visual reasoning. TokenPacker employs a coarse-to-fine approach, where low-resolution visual features are interpolated and refined using high-resolution, multi-level region-based cues to generate compact visual tokens. This method significantly reduces the number of visual tokens, achieving up to 89% compression while maintaining or improving performance across various benchmarks. The approach includes a dynamic image slicing scheme that supports input images of any aspect ratio, ensuring efficient high-resolution image understanding. Extensive experiments on diverse benchmarks demonstrate that TokenPacker outperforms existing methods in terms of efficiency and accuracy. It achieves a 75% reduction in visual tokens for LLaVA-1.5 while maintaining competitive performance. The method also excels in high-resolution image understanding, achieving state-of-the-art results on OCR-related benchmarks. TokenPacker's effectiveness is validated through ablation studies, showing that the coarse-to-fine design and dynamic image slicing significantly enhance visual token representation and overall performance. The method is efficient, with a high throughput, and demonstrates robust performance on comprehensive benchmarks and VQA-related tasks. The results highlight the importance of leveraging high-resolution imagery in multimodal tasks and the efficacy of TokenPacker in enhancing visual token representation and overall performance.
Reach us at info@study.space
[slides] TokenPacker%3A Efficient Visual Projector for Multimodal LLM | StudySpace