2024 | Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Fan Wang, Yang You, Kai Wang
This paper proposes Dynamic Tuning (DyT), a novel approach to improve both parameter and inference efficiency for Vision Transformer (ViT) adaptation. Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success in adapting ViTs by reducing the number of tunable parameters. However, they have not effectively addressed the issue of inference efficiency, which is crucial for deploying ViTs on computationally intensive tasks. DyT introduces a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block during inference, thereby reducing redundant computation. Additionally, DyT explores multiple design variants to find the best practice for efficient ViT adaptation. Inspired by the mixture-of-experts (MoE) mechanism, an enhanced adapter is introduced to further boost the adaptation performance. DyT is validated across various tasks, including image/video recognition and semantic segmentation. For instance, DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark. The results show that DyT is efficient in both parameter and inference across various visual tasks, offering a promising direction for efficient model adaptation.This paper proposes Dynamic Tuning (DyT), a novel approach to improve both parameter and inference efficiency for Vision Transformer (ViT) adaptation. Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success in adapting ViTs by reducing the number of tunable parameters. However, they have not effectively addressed the issue of inference efficiency, which is crucial for deploying ViTs on computationally intensive tasks. DyT introduces a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block during inference, thereby reducing redundant computation. Additionally, DyT explores multiple design variants to find the best practice for efficient ViT adaptation. Inspired by the mixture-of-experts (MoE) mechanism, an enhanced adapter is introduced to further boost the adaptation performance. DyT is validated across various tasks, including image/video recognition and semantic segmentation. For instance, DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark. The results show that DyT is efficient in both parameter and inference across various visual tasks, offering a promising direction for efficient model adaptation.