Multi-Scale Representations by Varying Window Attention for Semantic Segmentation

Multi-Scale Representations by Varying Window Attention for Semantic Segmentation

2024 | Haotian Yan, Ming Wu*, Chuang Zhang
This paper proposes a novel multi-scale learning approach called VWA (Varying Window Attention) to address the issues of scale inadequacy and field inactivation in semantic segmentation. The authors visualize the effective receptive fields (ERF) of existing multi-scale representations and identify two main problems: scale inadequacy, where certain scales are missing, and field inactivation, where parts of the receptive field are not activated. VWA is designed to overcome these issues by disentangling local window attention (LWA) into query and context windows, allowing the context window to vary in size to learn multi-scale representations. However, increasing the context window size significantly increases computational cost. To address this, the authors propose a re-scaling strategy that reduces the extra cost without compromising performance. The authors also introduce a multi-scale decoder (MSD) called VWFormer, which leverages VWA and MLPs to improve multi-scale representations for semantic segmentation. VWFormer achieves efficiency comparable to existing MSDs like FPN and MLP decoder, but performs significantly better in terms of performance. On the ADE20K dataset, VWFormer outperforms UPerNet by 1.0-2.5% mIoU using half of UPerNet's computation. When combined with Mask2Former, VWFormer improves performance by 1.0-1.3%. The paper also discusses the effectiveness of VWA in improving multi-scale representations, including the use of pre-scaling principles, densely overlapping patch embedding (DOPE), and copy-shift padding (CSP) to address computational and memory issues. The authors evaluate VWFormer on multiple datasets, including Cityscapes, ADE20K, and COCOStuff-164K, and show that it consistently outperforms existing methods in terms of both performance and efficiency. The results demonstrate that VWFormer is a powerful approach for semantic segmentation, capable of improving multi-scale representations while maintaining computational efficiency.This paper proposes a novel multi-scale learning approach called VWA (Varying Window Attention) to address the issues of scale inadequacy and field inactivation in semantic segmentation. The authors visualize the effective receptive fields (ERF) of existing multi-scale representations and identify two main problems: scale inadequacy, where certain scales are missing, and field inactivation, where parts of the receptive field are not activated. VWA is designed to overcome these issues by disentangling local window attention (LWA) into query and context windows, allowing the context window to vary in size to learn multi-scale representations. However, increasing the context window size significantly increases computational cost. To address this, the authors propose a re-scaling strategy that reduces the extra cost without compromising performance. The authors also introduce a multi-scale decoder (MSD) called VWFormer, which leverages VWA and MLPs to improve multi-scale representations for semantic segmentation. VWFormer achieves efficiency comparable to existing MSDs like FPN and MLP decoder, but performs significantly better in terms of performance. On the ADE20K dataset, VWFormer outperforms UPerNet by 1.0-2.5% mIoU using half of UPerNet's computation. When combined with Mask2Former, VWFormer improves performance by 1.0-1.3%. The paper also discusses the effectiveness of VWA in improving multi-scale representations, including the use of pre-scaling principles, densely overlapping patch embedding (DOPE), and copy-shift padding (CSP) to address computational and memory issues. The authors evaluate VWFormer on multiple datasets, including Cityscapes, ADE20K, and COCOStuff-164K, and show that it consistently outperforms existing methods in terms of both performance and efficiency. The results demonstrate that VWFormer is a powerful approach for semantic segmentation, capable of improving multi-scale representations while maintaining computational efficiency.
Reach us at info@study.space