The paper introduces a novel weakly supervised semantic segmentation (WSSS) method named *Adaptive Patch Contrast* (APC), which leverages Vision Transformers (ViT) to enhance patch embedding learning. APC addresses the limitations of existing ViT-based methods, such as the dominance of abnormal patches and the inefficiency of multi-stage training. The key contributions of APC include:
1. **Adaptive-K Pooling (AKP)**: This module replaces the traditional max pooling with an adaptive-K pooling layer, selecting the top-K patches based on their prediction scores to improve the robustness of patch-level predictions.
2. **Patch Contrastive Learning (PCL)**: This module enhances the intra-class compactness and inter-class separability of patch embeddings by calculating pairwise cosine similarities, thereby improving the accuracy of pseudo-labels.
3. **End-to-End Single-Stage Training**: APC transforms the multi-stage training framework into a single-stage approach, enhancing training efficiency and computational performance.
Experimental results on the PASCAL VOC 2012 and MS COCO 2014 datasets demonstrate that APC outperforms other state-of-the-art WSSS methods, achieving superior segmentation results with shorter training durations. The paper also includes ablation studies to validate the effectiveness of the proposed components and visualizations to illustrate the improved segmentation performance.The paper introduces a novel weakly supervised semantic segmentation (WSSS) method named *Adaptive Patch Contrast* (APC), which leverages Vision Transformers (ViT) to enhance patch embedding learning. APC addresses the limitations of existing ViT-based methods, such as the dominance of abnormal patches and the inefficiency of multi-stage training. The key contributions of APC include:
1. **Adaptive-K Pooling (AKP)**: This module replaces the traditional max pooling with an adaptive-K pooling layer, selecting the top-K patches based on their prediction scores to improve the robustness of patch-level predictions.
2. **Patch Contrastive Learning (PCL)**: This module enhances the intra-class compactness and inter-class separability of patch embeddings by calculating pairwise cosine similarities, thereby improving the accuracy of pseudo-labels.
3. **End-to-End Single-Stage Training**: APC transforms the multi-stage training framework into a single-stage approach, enhancing training efficiency and computational performance.
Experimental results on the PASCAL VOC 2012 and MS COCO 2014 datasets demonstrate that APC outperforms other state-of-the-art WSSS methods, achieving superior segmentation results with shorter training durations. The paper also includes ablation studies to validate the effectiveness of the proposed components and visualizations to illustrate the improved segmentation performance.