30 Mar 2024 | Joonmyung Choi*, Sanghyeok Lee*, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim†
The paper introduces vid-TLDR, a training-free token merging method for lightweight video Transformers. vid-TLDR aims to reduce computational costs and improve performance by merging background tokens without additional training. The key contributions of vid-TLDR are:
1. **Saliency Detection via Attention Sharpness**: vid-TLDR uses a sharpness function to detect salient regions in videos based on attention scores, even from the first layer of the Transformer. This helps in identifying foreground objects more accurately.
2. **Saliency-Aware Token Merging**: vid-TLDR introduces a saliency-aware token merging strategy that drops background tokens and adjusts the informativeness of foreground tokens. This strategy minimizes the impact of irrelevant tokens and enhances the efficiency of the Transformer.
Experiments show that vid-TLDR significantly reduces computational complexity while maintaining or improving performance compared to the base model UMT. The method is evaluated on various video tasks, including video-text retrieval and video question-answering, demonstrating competitive or superior performance with reduced FLOPs. The code for vid-TLDR is available at <https://github.com/mlvlab/vid-TLDR>.The paper introduces vid-TLDR, a training-free token merging method for lightweight video Transformers. vid-TLDR aims to reduce computational costs and improve performance by merging background tokens without additional training. The key contributions of vid-TLDR are:
1. **Saliency Detection via Attention Sharpness**: vid-TLDR uses a sharpness function to detect salient regions in videos based on attention scores, even from the first layer of the Transformer. This helps in identifying foreground objects more accurately.
2. **Saliency-Aware Token Merging**: vid-TLDR introduces a saliency-aware token merging strategy that drops background tokens and adjusts the informativeness of foreground tokens. This strategy minimizes the impact of irrelevant tokens and enhances the efficiency of the Transformer.
Experiments show that vid-TLDR significantly reduces computational complexity while maintaining or improving performance compared to the base model UMT. The method is evaluated on various video tasks, including video-text retrieval and video question-answering, demonstrating competitive or superior performance with reduced FLOPs. The code for vid-TLDR is available at <https://github.com/mlvlab/vid-TLDR>.