The paper "Multi-Scale Representations by Varying Window Attention for Semantic Segmentation" by Haotian Yan, Ming Wu, and Chuang Zhang addresses the challenges of multi-scale learning in semantic segmentation, particularly the issues of *scale inadequacy* and *field inactivation*. The authors propose a novel method called *varying window attention* (VWA), which leverages local window attention (LWA) and disentangles it into a query window and a context window. By varying the scale of the context window, VWA allows the query to learn representations at multiple scales while preserving the efficiency of LWA. To overcome the computational overhead introduced by varying the context window, the authors introduce a pre-scaling strategy, densely overlapping patch embedding (DOPE), and a copy-shift padding mode (CSP) to eliminate the extra cost without compromising performance.
building on VWA, the authors propose a multi-scale decoder (MSD) called *VWFormer*, which incorporates various MLPs to improve multi-scale representations for semantic segmentation. VWFormer is evaluated on multiple datasets, including ADE20K, Cityscapes, and COCOStuff-164k, and is shown to outperform existing methods in terms of both performance and efficiency. Specifically, VWFormer achieves a 1.0%–2.5% mIoU improvement over UPerNet with half the computation and a 1.0%–1.3% improvement over Mask2Former with minimal overhead.
The paper also includes detailed analyses of the receptive fields (ERFs) of existing multi-scale learning paradigms, visualizing their effective receptive fields (ERFs) and identifying the issues of scale inadequacy and field inactivation. The authors provide a comprehensive evaluation of VWFormer, including ablation studies and comparisons with state-of-the-art methods, demonstrating its effectiveness and efficiency in improving multi-scale representations for semantic segmentation.The paper "Multi-Scale Representations by Varying Window Attention for Semantic Segmentation" by Haotian Yan, Ming Wu, and Chuang Zhang addresses the challenges of multi-scale learning in semantic segmentation, particularly the issues of *scale inadequacy* and *field inactivation*. The authors propose a novel method called *varying window attention* (VWA), which leverages local window attention (LWA) and disentangles it into a query window and a context window. By varying the scale of the context window, VWA allows the query to learn representations at multiple scales while preserving the efficiency of LWA. To overcome the computational overhead introduced by varying the context window, the authors introduce a pre-scaling strategy, densely overlapping patch embedding (DOPE), and a copy-shift padding mode (CSP) to eliminate the extra cost without compromising performance.
building on VWA, the authors propose a multi-scale decoder (MSD) called *VWFormer*, which incorporates various MLPs to improve multi-scale representations for semantic segmentation. VWFormer is evaluated on multiple datasets, including ADE20K, Cityscapes, and COCOStuff-164k, and is shown to outperform existing methods in terms of both performance and efficiency. Specifically, VWFormer achieves a 1.0%–2.5% mIoU improvement over UPerNet with half the computation and a 1.0%–1.3% improvement over Mask2Former with minimal overhead.
The paper also includes detailed analyses of the receptive fields (ERFs) of existing multi-scale learning paradigms, visualizing their effective receptive fields (ERFs) and identifying the issues of scale inadequacy and field inactivation. The authors provide a comprehensive evaluation of VWFormer, including ablation studies and comparisons with state-of-the-art methods, demonstrating its effectiveness and efficiency in improving multi-scale representations for semantic segmentation.