SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

27 Mar 2024 | Seokju Yun, Youngmin Ro*
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design **Authors:** Seokju Yun, Youngmin Ro **Affiliation:** Machine Intelligence Laboratory, University of Seoul, Korea **GitHub:** <https://github.com/ysj9909/SHViT> **Abstract:** This paper addresses computational redundancy in Vision Transformers (ViTs) at both macro and micro levels, aiming to achieve state-of-the-art speed-accuracy trade-offs on resource-constrained devices. The authors propose a 16×16 patchify stem and a 3-stage design to reduce memory access costs and leverage token representations with reduced spatial redundancy. They also introduce a Single-Head Self-Attention (SHSA) module to prevent head redundancy and boost accuracy by combining global and local information. The proposed SHViT model achieves superior performance on ImageNet-1k, object detection, and instance segmentation tasks, outperforming existing models in terms of speed and accuracy. **Key Contributions:** 1. **Efficient Macro Design:** A 16×16 patchify stem and a 3-stage design reduce memory access costs and enable effective representation learning. 2. **Single-Head Self-Attention (SHSA):** Reduces computational redundancy and improves efficiency by applying attention to a subset of input channels. 3. **SHViT Model:** Achieves state-of-the-art speed-accuracy trade-offs on various devices and tasks. **Experiments:** - **ImageNet-1k Classification:** SHViT-S4 achieves 79.4% top-1 accuracy with 14283 images/s on an Nvidia A100 GPU and 509 images/s on an Intel Xeon Gold 5218R CPU, outperforming EfficientNet-B0 by 2.3% in accuracy and 69.4% in GPU inference speed. - **Object Detection and Instance Segmentation:** SHViT-S4 outperforms EfficientViT-M5,512 and EfficientNet in speed and accuracy, with significant improvements on mobile devices. **Conclusion:** SHViT addresses computational redundancies in ViTs, achieving efficient and accurate performance on diverse devices and tasks. Future work will focus on integrating the single-head design into existing attention methods and enhancing performance with fine-grained features.SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design **Authors:** Seokju Yun, Youngmin Ro **Affiliation:** Machine Intelligence Laboratory, University of Seoul, Korea **GitHub:** <https://github.com/ysj9909/SHViT> **Abstract:** This paper addresses computational redundancy in Vision Transformers (ViTs) at both macro and micro levels, aiming to achieve state-of-the-art speed-accuracy trade-offs on resource-constrained devices. The authors propose a 16×16 patchify stem and a 3-stage design to reduce memory access costs and leverage token representations with reduced spatial redundancy. They also introduce a Single-Head Self-Attention (SHSA) module to prevent head redundancy and boost accuracy by combining global and local information. The proposed SHViT model achieves superior performance on ImageNet-1k, object detection, and instance segmentation tasks, outperforming existing models in terms of speed and accuracy. **Key Contributions:** 1. **Efficient Macro Design:** A 16×16 patchify stem and a 3-stage design reduce memory access costs and enable effective representation learning. 2. **Single-Head Self-Attention (SHSA):** Reduces computational redundancy and improves efficiency by applying attention to a subset of input channels. 3. **SHViT Model:** Achieves state-of-the-art speed-accuracy trade-offs on various devices and tasks. **Experiments:** - **ImageNet-1k Classification:** SHViT-S4 achieves 79.4% top-1 accuracy with 14283 images/s on an Nvidia A100 GPU and 509 images/s on an Intel Xeon Gold 5218R CPU, outperforming EfficientNet-B0 by 2.3% in accuracy and 69.4% in GPU inference speed. - **Object Detection and Instance Segmentation:** SHViT-S4 outperforms EfficientViT-M5,512 and EfficientNet in speed and accuracy, with significant improvements on mobile devices. **Conclusion:** SHViT addresses computational redundancies in ViTs, achieving efficient and accurate performance on diverse devices and tasks. Future work will focus on integrating the single-head design into existing attention methods and enhancing performance with fine-grained features.
Reach us at info@study.space