MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

5 Mar 2024 | Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer This paper proposes a novel framework called Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) to accelerate Vision-Language Transformers (VLTs). VLTs have shown great success in various multimodal tasks, but suffer from high computational costs due to the large number of visual and language tokens. Existing token pruning methods for VLTs are limited in their ability to dynamically compress each layer based on different input samples and often ignore the critical role of aligning different modalities for guiding the token pruning process. The MADTP framework introduces a Multi-modality Alignment Guidance (MAG) module that aligns features of the same semantic concept from different modalities, ensuring that the pruned tokens are less important for all modalities. It also introduces a Dynamic Token Pruning (DTP) module that can adaptively adjust the token compression ratio in each layer based on different input instances. Extensive experiments on various benchmarks demonstrate that MADTP significantly reduces the computational complexity of various multimodal models while preserving competitive performance. Notably, when applied to the BLIP model in the NLVR2 dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance degradation. The MADTP framework consists of two main components: the MAG module and the DTP module. The MAG module is placed between the vision and language branches in VLTs, facilitating explicit alignment of representations across modalities and offering guidance for token pruning. The DTP module is incorporated within each transformer block, allowing for dynamic token pruning based on the complexity of input instances. The MAG module uses learnable tokens to facilitate cross-modal feature alignment and guide the multimodal token pruning. The DTP module dynamically adjusts the compression ratio for each layer based on the complexity of different input instances and the learned alignment guidance. The MADTP framework is evaluated on four multimodal datasets: NLVR2, COCO, Flickr30k, and VQA v2.0. The results show that MADTP achieves new state-of-the-art performance on these datasets. Notably, MADTP achieves outstanding compression on the BLIP model in the NLVR2 dataset, reducing GFLOPs by 80% while experiencing a performance decrease of less than 4%. The framework is also effective in image captioning and visual question answering tasks, demonstrating its versatility and effectiveness in accelerating VLTs.MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer This paper proposes a novel framework called Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) to accelerate Vision-Language Transformers (VLTs). VLTs have shown great success in various multimodal tasks, but suffer from high computational costs due to the large number of visual and language tokens. Existing token pruning methods for VLTs are limited in their ability to dynamically compress each layer based on different input samples and often ignore the critical role of aligning different modalities for guiding the token pruning process. The MADTP framework introduces a Multi-modality Alignment Guidance (MAG) module that aligns features of the same semantic concept from different modalities, ensuring that the pruned tokens are less important for all modalities. It also introduces a Dynamic Token Pruning (DTP) module that can adaptively adjust the token compression ratio in each layer based on different input instances. Extensive experiments on various benchmarks demonstrate that MADTP significantly reduces the computational complexity of various multimodal models while preserving competitive performance. Notably, when applied to the BLIP model in the NLVR2 dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance degradation. The MADTP framework consists of two main components: the MAG module and the DTP module. The MAG module is placed between the vision and language branches in VLTs, facilitating explicit alignment of representations across modalities and offering guidance for token pruning. The DTP module is incorporated within each transformer block, allowing for dynamic token pruning based on the complexity of input instances. The MAG module uses learnable tokens to facilitate cross-modal feature alignment and guide the multimodal token pruning. The DTP module dynamically adjusts the compression ratio for each layer based on the complexity of different input instances and the learned alignment guidance. The MADTP framework is evaluated on four multimodal datasets: NLVR2, COCO, Flickr30k, and VQA v2.0. The results show that MADTP achieves new state-of-the-art performance on these datasets. Notably, MADTP achieves outstanding compression on the BLIP model in the NLVR2 dataset, reducing GFLOPs by 80% while experiencing a performance decrease of less than 4%. The framework is also effective in image captioning and visual question answering tasks, demonstrating its versatility and effectiveness in accelerating VLTs.
Reach us at info@study.space
[slides] MADTP%3A Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer | StudySpace