This paper proposes a novel attention mechanism called coordinate attention for efficient mobile network design. Unlike traditional channel attention mechanisms that use 2D global pooling to generate a single feature vector, coordinate attention factorizes the attention process into two 1D feature encoding steps that aggregate features along the horizontal and vertical directions. This allows the model to capture long-range dependencies in one spatial direction while preserving precise positional information in the other. The resulting attention maps are then applied to the input feature map to enhance the representation of the objects of interest.
The coordinate attention mechanism is simple and can be easily integrated into classic mobile networks such as MobileNetV2, MobileNeXt, and EfficientNet with minimal computational overhead. Extensive experiments show that coordinate attention not only improves ImageNet classification performance but also performs better in downstream tasks such as object detection and semantic segmentation compared to existing attention mechanisms like SE block and CBAM.
The proposed method captures both cross-channel and direction-aware information, which helps models more accurately locate and recognize objects. It is flexible and lightweight, and can be easily plugged into various mobile network components. As a pretrained model, coordinate attention can significantly improve the performance of downstream tasks, especially those with dense predictions like semantic segmentation.
The experiments demonstrate that coordinate attention achieves a 0.8% improvement in ImageNet classification accuracy with comparable parameters and computation. In object detection and semantic segmentation, it outperforms other attention mechanisms. The method is also robust to different reduction ratios and performs well in powerful mobile networks like EfficientNet.
The coordinate attention mechanism is applied to both object detection and semantic segmentation tasks, showing its transferability across different vision tasks. In object detection, it improves detection results on COCO and Pascal VOC datasets. In semantic segmentation, it achieves better performance on Pascal VOC 2012 and Cityscapes datasets compared to other attention mechanisms. The results show that coordinate attention is particularly effective in tasks requiring precise spatial information, such as semantic segmentation.This paper proposes a novel attention mechanism called coordinate attention for efficient mobile network design. Unlike traditional channel attention mechanisms that use 2D global pooling to generate a single feature vector, coordinate attention factorizes the attention process into two 1D feature encoding steps that aggregate features along the horizontal and vertical directions. This allows the model to capture long-range dependencies in one spatial direction while preserving precise positional information in the other. The resulting attention maps are then applied to the input feature map to enhance the representation of the objects of interest.
The coordinate attention mechanism is simple and can be easily integrated into classic mobile networks such as MobileNetV2, MobileNeXt, and EfficientNet with minimal computational overhead. Extensive experiments show that coordinate attention not only improves ImageNet classification performance but also performs better in downstream tasks such as object detection and semantic segmentation compared to existing attention mechanisms like SE block and CBAM.
The proposed method captures both cross-channel and direction-aware information, which helps models more accurately locate and recognize objects. It is flexible and lightweight, and can be easily plugged into various mobile network components. As a pretrained model, coordinate attention can significantly improve the performance of downstream tasks, especially those with dense predictions like semantic segmentation.
The experiments demonstrate that coordinate attention achieves a 0.8% improvement in ImageNet classification accuracy with comparable parameters and computation. In object detection and semantic segmentation, it outperforms other attention mechanisms. The method is also robust to different reduction ratios and performs well in powerful mobile networks like EfficientNet.
The coordinate attention mechanism is applied to both object detection and semantic segmentation tasks, showing its transferability across different vision tasks. In object detection, it improves detection results on COCO and Pascal VOC datasets. In semantic segmentation, it achieves better performance on Pascal VOC 2012 and Cityscapes datasets compared to other attention mechanisms. The results show that coordinate attention is particularly effective in tasks requiring precise spatial information, such as semantic segmentation.