[slides] Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

The paper introduces EBSeg, a novel framework for open-vocabulary semantic segmentation that addresses the challenge of overfitting to training classes in CLIP-based models. EBSeg incorporates two key components: the Adaptively Balanced Decoder (AdaB Decoder) and the Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder generates different image embeddings for both training and new classes, which are then adaptively balanced to leverage their strengths in recognizing training classes and generalizing to new classes. The SSC Loss aligns the inter-class affinity in the image feature space with that in the text feature space of CLIP, enhancing the model's ability to learn a consistent semantic structure. Additionally, a frozen SAM image encoder is used to complement the spatial information lacking in CLIP features. Extensive experiments on various benchmarks demonstrate that EBSeg outperforms state-of-the-art methods, achieving significant improvements in mIoU. The code and trained models are available at <https://github.com/slonetime/EBSeg>.The paper introduces EBSeg, a novel framework for open-vocabulary semantic segmentation that addresses the challenge of overfitting to training classes in CLIP-based models. EBSeg incorporates two key components: the Adaptively Balanced Decoder (AdaB Decoder) and the Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder generates different image embeddings for both training and new classes, which are then adaptively balanced to leverage their strengths in recognizing training classes and generalizing to new classes. The SSC Loss aligns the inter-class affinity in the image feature space with that in the text feature space of CLIP, enhancing the model's ability to learn a consistent semantic structure. Additionally, a frozen SAM image encoder is used to complement the spatial information lacking in CLIP features. Extensive experiments on various benchmarks demonstrate that EBSeg outperforms state-of-the-art methods, achieving significant improvements in mIoU. The code and trained models are available at <https://github.com/slonetime/EBSeg>.

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

14 Jun 2024 | Xiangheng Shan, Dongyue Wu, Guilin Zhu, Yuanjie Shao*, Nong Sang, Changxin Gao