5 Mar 2024 | Jianjian Cao1 Peng Ye1 Shengze Li1 Chong Yu2 Yansong Tang3 Jiwen Lu3 Tao Chen1†
The paper introduces MADTP (Multimodal Alignment-Guided Dynamic Token Pruning), a novel framework designed to accelerate Vision-Language Transformers (VLTs) by reducing computational costs. VLTs, which combine visual and language modalities, have shown great success but are computationally expensive due to the large number of tokens. Existing token pruning methods often ignore the importance of aligning different modalities, leading to the false pruning of important tokens. Additionally, these methods lack the flexibility to dynamically adjust the compression ratio based on different input samples.
MADTP addresses these issues by introducing two key components: the Multi-modality Alignment Guidance (MAG) module and the Dynamic Token Pruning (DTP) module. The MAG module aligns features from different modalities, ensuring that pruned tokens are less important across both modalities. The DTP module dynamically adjusts the compression ratio for each layer based on the complexity of different input instances.
Experiments on various benchmarks demonstrate that MADTP significantly reduces computational complexity while maintaining competitive performance. Notably, when applied to the BLIP model on the NLVR2 dataset, MADTP reduces GFLOPs by 80% with less than 4% performance degradation. The code for MADTP is available at <https://github.com/double125/MADTP>.The paper introduces MADTP (Multimodal Alignment-Guided Dynamic Token Pruning), a novel framework designed to accelerate Vision-Language Transformers (VLTs) by reducing computational costs. VLTs, which combine visual and language modalities, have shown great success but are computationally expensive due to the large number of tokens. Existing token pruning methods often ignore the importance of aligning different modalities, leading to the false pruning of important tokens. Additionally, these methods lack the flexibility to dynamically adjust the compression ratio based on different input samples.
MADTP addresses these issues by introducing two key components: the Multi-modality Alignment Guidance (MAG) module and the Dynamic Token Pruning (DTP) module. The MAG module aligns features from different modalities, ensuring that pruned tokens are less important across both modalities. The DTP module dynamically adjusts the compression ratio for each layer based on the complexity of different input instances.
Experiments on various benchmarks demonstrate that MADTP significantly reduces computational complexity while maintaining competitive performance. Notably, when applied to the BLIP model on the NLVR2 dataset, MADTP reduces GFLOPs by 80% with less than 4% performance degradation. The code for MADTP is available at <https://github.com/double125/MADTP>.