TokenPacker: Efficient Visual Projector for Multimodal LLM

TokenPacker: Efficient Visual Projector for Multimodal LLM

28 Aug 2024 | Wentong Li1*, Yuqian Yuan1*, Jian Liu2, Dongqi Tang2, Song Wang1, Jie Qin3, Jianke Zhu1, Lei Zhang4
The paper introduces TokenPacker, an efficient visual projector designed for Multimodal Large Language Models (MLLMs). TokenPacker aims to bridge the visual encoder and the LLM by generating condensed visual tokens while preserving fine-grained visual details. The method employs a coarse-to-fine scheme, first downsample visual features to a low-resolution representation and then uses high-resolution region-based cues to enrich this representation. This process effectively reduces the number of visual tokens by 75% to 89% while maintaining or improving performance across various benchmarks. The approach is evaluated on multiple datasets, including CC-558K, Mini-Gemini, and LLaVA-1.5, demonstrating superior efficiency and effectiveness in high-resolution image understanding. The source code is available at https://github.com/CircleRadon TokenNamecker.The paper introduces TokenPacker, an efficient visual projector designed for Multimodal Large Language Models (MLLMs). TokenPacker aims to bridge the visual encoder and the LLM by generating condensed visual tokens while preserving fine-grained visual details. The method employs a coarse-to-fine scheme, first downsample visual features to a low-resolution representation and then uses high-resolution region-based cues to enrich this representation. This process effectively reduces the number of visual tokens by 75% to 89% while maintaining or improving performance across various benchmarks. The approach is evaluated on multiple datasets, including CC-558K, Mini-Gemini, and LLaVA-1.5, demonstrating superior efficiency and effectiveness in high-resolution image understanding. The source code is available at https://github.com/CircleRadon TokenNamecker.
Reach us at info@study.space
[slides] TokenPacker%3A Efficient Visual Projector for Multimodal LLM | StudySpace