Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

14 Jun 2024 | Xiangheng Shan, Dongyue Wu, Guilin Zhu, Yuanjie Shao*, Nong Sang, Changxin Gao
This paper proposes a novel framework for open-vocabulary semantic segmentation called EBSeg, which incorporates an Adaptively Balanced Decoder (AdaB Decoder) and a Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder generates different image embeddings for both training and new classes, adaptively balancing them to fully exploit their ability to recognize training classes and generalize to new classes. The SSC Loss aligns the inter-class affinity in the image feature space with that in the text feature space of CLIP, improving the generalization ability of the model. Additionally, a frozen SAM image encoder is used to complement the spatial information of CLIP features. Extensive experiments on various benchmarks demonstrate that EBSeg outperforms state-of-the-art methods, achieving significant improvements in mIoU metrics. The method effectively addresses the challenge of overfitting to training classes by leveraging the strengths of CLIP and SAM, and it establishes a new state-of-the-art in open-vocabulary semantic segmentation.This paper proposes a novel framework for open-vocabulary semantic segmentation called EBSeg, which incorporates an Adaptively Balanced Decoder (AdaB Decoder) and a Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder generates different image embeddings for both training and new classes, adaptively balancing them to fully exploit their ability to recognize training classes and generalize to new classes. The SSC Loss aligns the inter-class affinity in the image feature space with that in the text feature space of CLIP, improving the generalization ability of the model. Additionally, a frozen SAM image encoder is used to complement the spatial information of CLIP features. Extensive experiments on various benchmarks demonstrate that EBSeg outperforms state-of-the-art methods, achieving significant improvements in mIoU metrics. The method effectively addresses the challenge of overfitting to training classes by leveraging the strengths of CLIP and SAM, and it establishes a new state-of-the-art in open-vocabulary semantic segmentation.
Reach us at info@study.space