11 Jul 2024 | Tong Shao, Zhoutao Tian, Hang Zhao, and Jingyong Su
The paper "Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation" by Tong Shao, Zhutao Tian, Hang Zhao, and Jingyong Su from Harbin Institute of Technology, Shenzhen, China, explores the limitations of CLIP (Contrastive Language-Image Pre-training) in open-vocabulary semantic segmentation (OVSS) and proposes a novel training-free approach called CLIPtrase to enhance its performance. CLIP, while achieving impressive zero-shot generalization, faces challenges in detailed local context discrimination due to its initial image-level alignment training. The study reveals that the [CLS] token in CLIP dominates "global" patches, which hinders local feature discrimination. To address this, CLIPtrase enhances local feature awareness through recalibrated self-correlation among patches, improving segmentation accuracy and semantic coherence. Experiments show that CLIPtrase outperforms CLIP and existing state-of-the-art training-free methods by 22.3% on average across 9 segmentation benchmarks. The method involves three core components: Semantic Correlation Recovery, Patch Clustering, and Denoising, which collectively restore semantic correlations and refine object boundaries. The paper also discusses the integration of CLIPtrase with other models like SAM to further improve segmentation performance.The paper "Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation" by Tong Shao, Zhutao Tian, Hang Zhao, and Jingyong Su from Harbin Institute of Technology, Shenzhen, China, explores the limitations of CLIP (Contrastive Language-Image Pre-training) in open-vocabulary semantic segmentation (OVSS) and proposes a novel training-free approach called CLIPtrase to enhance its performance. CLIP, while achieving impressive zero-shot generalization, faces challenges in detailed local context discrimination due to its initial image-level alignment training. The study reveals that the [CLS] token in CLIP dominates "global" patches, which hinders local feature discrimination. To address this, CLIPtrase enhances local feature awareness through recalibrated self-correlation among patches, improving segmentation accuracy and semantic coherence. Experiments show that CLIPtrase outperforms CLIP and existing state-of-the-art training-free methods by 22.3% on average across 9 segmentation benchmarks. The method involves three core components: Semantic Correlation Recovery, Patch Clustering, and Denoising, which collectively restore semantic correlations and refine object boundaries. The paper also discusses the integration of CLIPtrase with other models like SAM to further improve segmentation performance.