Understanding ClearCLIP%3A Decomposing CLIP Representations for Dense Vision-Language Inference

The paper "ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference" addresses the challenge of applying large-scale Vision-Language Models (VLMs), particularly CLIP, to semantic segmentation tasks. The authors identify residual connections as the primary source of noise that degrades segmentation quality. They propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. ClearCLIP involves three modifications: removing the residual connection, implementing self-self attention, and discarding the feed-forward network (FFN). The method consistently generates clearer and more accurate segmentation maps, outperforming existing approaches across multiple benchmarks. The paper includes a thorough analysis of feature statistics, ablation studies, and qualitative comparisons, demonstrating the effectiveness of ClearCLIP in improving segmentation performance.The paper "ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference" addresses the challenge of applying large-scale Vision-Language Models (VLMs), particularly CLIP, to semantic segmentation tasks. The authors identify residual connections as the primary source of noise that degrades segmentation quality. They propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. ClearCLIP involves three modifications: removing the residual connection, implementing self-self attention, and discarding the feed-forward network (FFN). The method consistently generates clearer and more accurate segmentation maps, outperforming existing approaches across multiple benchmarks. The paper includes a thorough analysis of feature statistics, ablation studies, and qualitative comparisons, demonstrating the effectiveness of ClearCLIP in improving segmentation performance.

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

17 Jul 2024 | Mengcheng Lan1, Chaofeng Chen1, Yiping Ke2, Xinjiang Wang3, Litong Feng3*, and Wayne Zhang3