ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

17 Jul 2024 | Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng*, and Wayne Zhang
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference This paper investigates the challenges of applying CLIP to semantic segmentation tasks, where CLIP's image-text contrastive training emphasizes global features at the expense of local discriminability, leading to noisy segmentation maps. The authors propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. ClearCLIP introduces three modifications to the final layer of CLIP: removing the residual connection, implementing self-self attention, and discarding the feed-forward network. These modifications result in clearer and more accurate segmentation maps, outperforming existing approaches on multiple benchmarks. The study reveals that the residual connection in CLIP significantly affects segmentation quality, as it introduces noise by emphasizing global features over local discriminability. By removing the residual connection and adopting self-self attention, ClearCLIP achieves superior performance in open-vocabulary semantic segmentation. The self-attention mechanism helps separate dissimilar spatial features, improving segmentation accuracy. Additionally, the feed-forward network is found to have a negligible effect on image representation during inference, and its removal further enhances performance when combined with the removal of the residual connection. ClearCLIP is evaluated on eight benchmark datasets and demonstrates significant improvements in performance, achieving the best results on four out of five datasets. It outperforms existing methods such as TCL, SCLIP, and MaskCLIP, particularly when using larger models. The results show that ClearCLIP's approach of decomposing CLIP's representations leads to more accurate and reliable segmentation maps, making it a promising solution for dense vision-language inference tasks.ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference This paper investigates the challenges of applying CLIP to semantic segmentation tasks, where CLIP's image-text contrastive training emphasizes global features at the expense of local discriminability, leading to noisy segmentation maps. The authors propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. ClearCLIP introduces three modifications to the final layer of CLIP: removing the residual connection, implementing self-self attention, and discarding the feed-forward network. These modifications result in clearer and more accurate segmentation maps, outperforming existing approaches on multiple benchmarks. The study reveals that the residual connection in CLIP significantly affects segmentation quality, as it introduces noise by emphasizing global features over local discriminability. By removing the residual connection and adopting self-self attention, ClearCLIP achieves superior performance in open-vocabulary semantic segmentation. The self-attention mechanism helps separate dissimilar spatial features, improving segmentation accuracy. Additionally, the feed-forward network is found to have a negligible effect on image representation during inference, and its removal further enhances performance when combined with the removal of the residual connection. ClearCLIP is evaluated on eight benchmark datasets and demonstrates significant improvements in performance, achieving the best results on four out of five datasets. It outperforms existing methods such as TCL, SCLIP, and MaskCLIP, particularly when using larger models. The results show that ClearCLIP's approach of decomposing CLIP's representations leads to more accurate and reliable segmentation maps, making it a promising solution for dense vision-language inference tasks.
Reach us at info@study.space
Understanding ClearCLIP%3A Decomposing CLIP Representations for Dense Vision-Language Inference