MOBILEViT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER

MOBILEViT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER

4 Mar 2022 | Sachin Mehta, Mohammad Rastegari
MobileViT is a lightweight, general-purpose, and mobile-friendly vision transformer designed for mobile vision tasks. It combines the strengths of convolutional neural networks (CNNs) and vision transformers (ViTs) to achieve better performance and efficiency. MobileViT introduces a novel block that encodes both local and global information in a tensor, enabling it to learn representations with fewer parameters and simpler training. Compared to CNN-based models like MobileNetv3 and ViT-based models like DeIT, MobileViT achieves higher accuracy with a similar number of parameters. On the ImageNet-1k dataset, MobileViT achieves a top-1 accuracy of 78.4% with about 6 million parameters, outperforming MobileNetv3 by 3.2% and DeIT by 6.2%. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. MobileViT is also more efficient than ViT-based models, with 2× fewer FLOPs and 1.8% better accuracy on the ImageNet-1K dataset. MobileViT is designed to be lightweight, general-purpose, and mobile-friendly, making it suitable for a wide range of mobile vision tasks. It is also efficient on mobile devices, with inference times under 33 ms for real-time performance. MobileViT outperforms other ViT-based models like DeIT and PiT in terms of accuracy and efficiency. The source code is open-source and available at https://github.com/apple/ml-cvnets.MobileViT is a lightweight, general-purpose, and mobile-friendly vision transformer designed for mobile vision tasks. It combines the strengths of convolutional neural networks (CNNs) and vision transformers (ViTs) to achieve better performance and efficiency. MobileViT introduces a novel block that encodes both local and global information in a tensor, enabling it to learn representations with fewer parameters and simpler training. Compared to CNN-based models like MobileNetv3 and ViT-based models like DeIT, MobileViT achieves higher accuracy with a similar number of parameters. On the ImageNet-1k dataset, MobileViT achieves a top-1 accuracy of 78.4% with about 6 million parameters, outperforming MobileNetv3 by 3.2% and DeIT by 6.2%. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. MobileViT is also more efficient than ViT-based models, with 2× fewer FLOPs and 1.8% better accuracy on the ImageNet-1K dataset. MobileViT is designed to be lightweight, general-purpose, and mobile-friendly, making it suitable for a wide range of mobile vision tasks. It is also efficient on mobile devices, with inference times under 33 ms for real-time performance. MobileViT outperforms other ViT-based models like DeIT and PiT in terms of accuracy and efficiency. The source code is open-source and available at https://github.com/apple/ml-cvnets.
Reach us at info@study.space