2 Jun 2016 | Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, Alan L. Yuille
This paper introduces an attention-based approach for semantic image segmentation, which learns to weigh multi-scale features at each pixel location. The method is applied to a state-of-the-art semantic segmentation model, which is jointly trained with multi-scale input images and the attention model. The attention model not only outperforms average- and max-pooling, but also allows for diagnostic visualization of feature importance at different positions and scales. The paper demonstrates the effectiveness of the model on three challenging datasets: PASCAL-Person-Part, PASCAL VOC 2012, and a subset of MS-COCO 2014. The results show that the proposed method consistently improves over strong baselines, and the attention component provides non-trivial improvements over average- and max-pooling methods. The model also enables diagnostic visualization of feature importance at different scales for each image position. The paper also introduces extra supervision to the output at each scale, which is essential for achieving excellent performance when merging multi-scale features. The model is trained end-to-end, allowing for adaptive weight learning on scales without requiring tedious annotations of the "ground truth scale" for each pixel. The proposed method is shown to be effective in improving performance on multiple datasets, and the attention model provides interpretable weight maps for different scales. The paper also discusses related work, including deep networks, multi-scale features for semantic segmentation, and attention models. The model is based on the DeepLab architecture, which is modified to be a share-net and demonstrates its effectiveness on three challenging datasets. The paper concludes that using multi-scale inputs yields better performance than a single scale input, and that merging multi-scale features with the proposed attention model not only improves performance over average- or max-pooling baselines, but also allows for diagnostic visualization of feature importance at different positions and scales. The model is trained end-to-end, allowing for adaptive weight learning on scales without requiring tedious annotations of the "ground truth scale" for each pixel.This paper introduces an attention-based approach for semantic image segmentation, which learns to weigh multi-scale features at each pixel location. The method is applied to a state-of-the-art semantic segmentation model, which is jointly trained with multi-scale input images and the attention model. The attention model not only outperforms average- and max-pooling, but also allows for diagnostic visualization of feature importance at different positions and scales. The paper demonstrates the effectiveness of the model on three challenging datasets: PASCAL-Person-Part, PASCAL VOC 2012, and a subset of MS-COCO 2014. The results show that the proposed method consistently improves over strong baselines, and the attention component provides non-trivial improvements over average- and max-pooling methods. The model also enables diagnostic visualization of feature importance at different scales for each image position. The paper also introduces extra supervision to the output at each scale, which is essential for achieving excellent performance when merging multi-scale features. The model is trained end-to-end, allowing for adaptive weight learning on scales without requiring tedious annotations of the "ground truth scale" for each pixel. The proposed method is shown to be effective in improving performance on multiple datasets, and the attention model provides interpretable weight maps for different scales. The paper also discusses related work, including deep networks, multi-scale features for semantic segmentation, and attention models. The model is based on the DeepLab architecture, which is modified to be a share-net and demonstrates its effectiveness on three challenging datasets. The paper concludes that using multi-scale inputs yields better performance than a single scale input, and that merging multi-scale features with the proposed attention model not only improves performance over average- or max-pooling baselines, but also allows for diagnostic visualization of feature importance at different positions and scales. The model is trained end-to-end, allowing for adaptive weight learning on scales without requiring tedious annotations of the "ground truth scale" for each pixel.