[slides] MobileViT%3A Light-weight%2C General-purpose%2C and Mobile-friendly Vision Transformer

**Abstract:** Light-weight convolutional neural networks (CNNs) are widely used for mobile vision tasks due to their spatial inductive biases, which allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. Self-attention-based vision transformers (ViTs) have been adopted to learn global representations, but they are heavy-weight. This paper introduces MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT combines the strengths of CNNs and ViTs by encoding both local and global information in a tensor effectively. The results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeiT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. **Introduction:** Self-attention-based models, especially ViTs, have emerged as an alternative to CNNs for learning visual representations. However, ViTs are heavy-weight and exhibit sub-standard optimizability. This paper introduces MobileViT, a light-weight ViT model that combines the benefits of CNNs and ViTs. MobileViT uses standard convolutions and transformers to learn local and global representations, respectively. The MobileViT block encodes both local and global information in an input tensor with fewer parameters. Unlike ViTs, MobileViT retains the spatial order of pixels and patches, allowing it to learn global representations with spatial inductive biases. This design results in a light-weight and general-purpose network that is easy to optimize and integrate with downstream architectures. **Related Work:** Light-weight CNNs have powered many mobile vision tasks due to their versatility and ease of training. However, they are spatially local. Vision transformers (ViTs) have shown superior performance on large-scale datasets but require extensive data augmentation and are difficult to train. Previous works have explored hybrid approaches that combine convolutions and transformers to improve the stability and performance of ViTs. However, these models are still heavy-weight and sensitive to hyper-parameters. **MobileViT Architecture:** MobileViT's architecture is inspired by light-weight CNNs. It uses standard convolutions and transformers to learn local and global representations, respectively. The MobileViT block encodes both local and global information in an input tensor with fewer parameters. The computational cost of MobileViT is similar to that of ViTs, but it is more efficient in practice due to its convolution-like properties. MobileViT is trained at different network sizes (small, extra small, and extra extra small)**Abstract:** Light-weight convolutional neural networks (CNNs) are widely used for mobile vision tasks due to their spatial inductive biases, which allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. Self-attention-based vision transformers (ViTs) have been adopted to learn global representations, but they are heavy-weight. This paper introduces MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT combines the strengths of CNNs and ViTs by encoding both local and global information in a tensor effectively. The results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeiT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. **Introduction:** Self-attention-based models, especially ViTs, have emerged as an alternative to CNNs for learning visual representations. However, ViTs are heavy-weight and exhibit sub-standard optimizability. This paper introduces MobileViT, a light-weight ViT model that combines the benefits of CNNs and ViTs. MobileViT uses standard convolutions and transformers to learn local and global representations, respectively. The MobileViT block encodes both local and global information in an input tensor with fewer parameters. Unlike ViTs, MobileViT retains the spatial order of pixels and patches, allowing it to learn global representations with spatial inductive biases. This design results in a light-weight and general-purpose network that is easy to optimize and integrate with downstream architectures. **Related Work:** Light-weight CNNs have powered many mobile vision tasks due to their versatility and ease of training. However, they are spatially local. Vision transformers (ViTs) have shown superior performance on large-scale datasets but require extensive data augmentation and are difficult to train. Previous works have explored hybrid approaches that combine convolutions and transformers to improve the stability and performance of ViTs. However, these models are still heavy-weight and sensitive to hyper-parameters. **MobileViT Architecture:** MobileViT's architecture is inspired by light-weight CNNs. It uses standard convolutions and transformers to learn local and global representations, respectively. The MobileViT block encodes both local and global information in an input tensor with fewer parameters. The computational cost of MobileViT is similar to that of ViTs, but it is more efficient in practice due to its convolution-like properties. MobileViT is trained at different network sizes (small, extra small, and extra extra small)

MOBILEViT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER

4 Mar 2022 | Sachin Mehta, Mohammad Rastegari