SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

27 Mar 2024 | Seokju Yun, Youngmin Ro
SHViT: A Single-Head Vision Transformer with Memory-Efficient Macro Design This paper proposes SHViT, a single-head vision transformer that achieves state-of-the-art speed-accuracy tradeoff. SHViT addresses computational redundancy at all design levels through memory-efficient macro and micro design. The macro design uses a larger-stride patchify stem and 3-stage structure, reducing memory access costs and achieving competitive performance. The micro design replaces multi-head attention with single-head attention, eliminating head redundancy and improving accuracy by combining global and local information. SHViT outperforms existing models on ImageNet-1K, achieving 1.3% higher accuracy than MobileViTv2 while being 3.3× faster on GPU, 8.1× faster on CPU, and 2.4× faster on iPhone 12. On COCO, SHViT achieves performance comparable to FastViT-SA12 with 3.8× and 2.0× lower backbone latency on GPU and mobile device, respectively. SHViT-S4 achieves 79.4% top-1 accuracy on ImageNet with 14283 images/s throughput on A100 GPU and 509 images/s on Intel CPU. SHViT's single-head attention module reduces computational redundancy by processing only a subset of input channels, leading to significant improvements in speed and accuracy. The model's architecture includes a 16×16 overlapping patch embedding layer, three stages of SHViT blocks, and a single-head self-attention layer for global context modeling. SHViT's design reduces memory access costs and enables efficient computation on diverse devices. Experiments show that SHViT achieves superior performance on classification, detection, and segmentation tasks. The model's single-head design reduces computational redundancy, leading to a more significant impact on the speed-accuracy tradeoff than efficient attention variants or simple operations like pooling. SHViT's memory-efficient macro design and single-head attention module enable efficient inference on various devices, including mobile platforms. The model's performance is validated on multiple tasks, including object detection and instance segmentation on COCO.SHViT: A Single-Head Vision Transformer with Memory-Efficient Macro Design This paper proposes SHViT, a single-head vision transformer that achieves state-of-the-art speed-accuracy tradeoff. SHViT addresses computational redundancy at all design levels through memory-efficient macro and micro design. The macro design uses a larger-stride patchify stem and 3-stage structure, reducing memory access costs and achieving competitive performance. The micro design replaces multi-head attention with single-head attention, eliminating head redundancy and improving accuracy by combining global and local information. SHViT outperforms existing models on ImageNet-1K, achieving 1.3% higher accuracy than MobileViTv2 while being 3.3× faster on GPU, 8.1× faster on CPU, and 2.4× faster on iPhone 12. On COCO, SHViT achieves performance comparable to FastViT-SA12 with 3.8× and 2.0× lower backbone latency on GPU and mobile device, respectively. SHViT-S4 achieves 79.4% top-1 accuracy on ImageNet with 14283 images/s throughput on A100 GPU and 509 images/s on Intel CPU. SHViT's single-head attention module reduces computational redundancy by processing only a subset of input channels, leading to significant improvements in speed and accuracy. The model's architecture includes a 16×16 overlapping patch embedding layer, three stages of SHViT blocks, and a single-head self-attention layer for global context modeling. SHViT's design reduces memory access costs and enables efficient computation on diverse devices. Experiments show that SHViT achieves superior performance on classification, detection, and segmentation tasks. The model's single-head design reduces computational redundancy, leading to a more significant impact on the speed-accuracy tradeoff than efficient attention variants or simple operations like pooling. SHViT's memory-efficient macro design and single-head attention module enable efficient inference on various devices, including mobile platforms. The model's performance is validated on multiple tasks, including object detection and instance segmentation on COCO.
Reach us at info@study.space
[slides and audio] SHViT%3A_Single-Head_Vision_Transformer_with_Memory_Efficient_Macro_Design