vid-TLDR: Training Free Token merging for Light-weight Video Transformer

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

30 Mar 2024 | Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim
vid-TLDR is a training-free token merging method for lightweight video transformers that aims to enhance the efficiency of video transformers by merging background tokens without additional training. The method uses attention maps to detect salient regions in videos and introduces a saliency-aware token merging strategy to suppress irrelevant tokens. The approach first detects salient regions via attention sharpness and then merges tokens based on saliency scores. Experiments show that vid-TLDR significantly reduces computational complexity while achieving competitive performance compared to the base model. The method is effective in video-text retrieval and video question-answering tasks, with performance improvements of (+0.8%, +0.5%, +1.1%) and FLOP reductions of at least 39.5% on MSRVTT, MSVD, and DiDeMo datasets. vid-TLDR also shows superior performance in action recognition and base-to-novel generalization tasks. The method is efficient and effective, with ablation studies confirming its effectiveness in reducing computational costs and improving performance. The results demonstrate that vid-TLDR is a promising approach for lightweight video transformers.vid-TLDR is a training-free token merging method for lightweight video transformers that aims to enhance the efficiency of video transformers by merging background tokens without additional training. The method uses attention maps to detect salient regions in videos and introduces a saliency-aware token merging strategy to suppress irrelevant tokens. The approach first detects salient regions via attention sharpness and then merges tokens based on saliency scores. Experiments show that vid-TLDR significantly reduces computational complexity while achieving competitive performance compared to the base model. The method is effective in video-text retrieval and video question-answering tasks, with performance improvements of (+0.8%, +0.5%, +1.1%) and FLOP reductions of at least 39.5% on MSRVTT, MSVD, and DiDeMo datasets. vid-TLDR also shows superior performance in action recognition and base-to-novel generalization tasks. The method is efficient and effective, with ablation studies confirming its effectiveness in reducing computational costs and improving performance. The results demonstrate that vid-TLDR is a promising approach for lightweight video transformers.
Reach us at info@study.space
[slides] vid-TLDR%3A Training Free Token merging for Light-Weight Video Transformer | StudySpace