2024 | Xu Ma¹, Xiyang Dai², Jianwei Yang², Bin Xiao², Yinpeng Chen², Yun Fu¹, Lu Yuan²
EfficientMod is a novel design for efficient vision networks that improves the trade-off between accuracy and efficiency. The modulation mechanism, which uses convolutional context modeling and feature projection layers, is tailored into the EfficientMod block, which serves as the essential building block for our networks. This block efficiently modulates projected features through a simple context modeling design, achieving better performance than existing methods. When integrated with the vanilla self-attention block, the hybrid architecture further improves performance without sacrificing efficiency. Comprehensive experiments show that EfficientMod-s outperforms EfficientFormerV2-s2 by 0.6 top-1 accuracy and is 25% faster on GPU, and outperforms MobileViTv2-1.0 by 2.9 at the same GPU latency. Additionally, EfficientMod performs better on downstream tasks, achieving a 3.6 mIoU improvement on the ADE20K benchmark. EfficientMod is a pure convolutional-based network that is efficient and compatible with other designs. It is orthogonal to the traditional self-attention block and can be combined with attention blocks to create a hybrid architecture. EfficientMod is effective and efficient, achieving state-of-the-art performance in efficient networks. The design of EfficientMod is simple and inherits all benefits of the modulation mechanism. It is more efficient and elegant than existing methods. EfficientMod is a unified block that incorporates favorable properties from both convolution and attention mechanisms. It simultaneously extracts spatial context and projects input features, then fuses them using element-wise multiplication. EfficientMod's design is elegant and efficient, with a strong representational ability. The method has great promise for efficient applications. Limitations and broader impacts include the scalability of efficient designs and the potential for further improvements in computational efficiency.EfficientMod is a novel design for efficient vision networks that improves the trade-off between accuracy and efficiency. The modulation mechanism, which uses convolutional context modeling and feature projection layers, is tailored into the EfficientMod block, which serves as the essential building block for our networks. This block efficiently modulates projected features through a simple context modeling design, achieving better performance than existing methods. When integrated with the vanilla self-attention block, the hybrid architecture further improves performance without sacrificing efficiency. Comprehensive experiments show that EfficientMod-s outperforms EfficientFormerV2-s2 by 0.6 top-1 accuracy and is 25% faster on GPU, and outperforms MobileViTv2-1.0 by 2.9 at the same GPU latency. Additionally, EfficientMod performs better on downstream tasks, achieving a 3.6 mIoU improvement on the ADE20K benchmark. EfficientMod is a pure convolutional-based network that is efficient and compatible with other designs. It is orthogonal to the traditional self-attention block and can be combined with attention blocks to create a hybrid architecture. EfficientMod is effective and efficient, achieving state-of-the-art performance in efficient networks. The design of EfficientMod is simple and inherits all benefits of the modulation mechanism. It is more efficient and elegant than existing methods. EfficientMod is a unified block that incorporates favorable properties from both convolution and attention mechanisms. It simultaneously extracts spatial context and projects input features, then fuses them using element-wise multiplication. EfficientMod's design is elegant and efficient, with a strong representational ability. The method has great promise for efficient applications. Limitations and broader impacts include the scalability of efficient designs and the potential for further improvements in computational efficiency.