5 Jan 2024 | Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, Chen Change Loy
The paper introduces Open-Vocabulary SAM, a unified framework that integrates the Segment Anything Model (SAM) and CLIP to enable interactive segmentation and recognition of over 22,000 classes. SAM excels in segmentation tasks, while CLIP is renowned for zero-shot recognition. The proposed model addresses the limitations of both models by leveraging two knowledge transfer modules: SAM2CLIP and CLIP2SAM. SAM2CLIP adapts SAM's knowledge into CLIP via distillation and learnable transformer adapters, while CLIP2SAM transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show that Open-Vocabulary SAM outperforms combined SAM and CLIP methods, achieving significant improvements in both segmentation and recognition tasks. The model is flexible and can be integrated with various detectors, making it suitable for both closed-set and open-set environments.The paper introduces Open-Vocabulary SAM, a unified framework that integrates the Segment Anything Model (SAM) and CLIP to enable interactive segmentation and recognition of over 22,000 classes. SAM excels in segmentation tasks, while CLIP is renowned for zero-shot recognition. The proposed model addresses the limitations of both models by leveraging two knowledge transfer modules: SAM2CLIP and CLIP2SAM. SAM2CLIP adapts SAM's knowledge into CLIP via distillation and learnable transformer adapters, while CLIP2SAM transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show that Open-Vocabulary SAM outperforms combined SAM and CLIP methods, achieving significant improvements in both segmentation and recognition tasks. The model is flexible and can be integrated with various detectors, making it suitable for both closed-set and open-set environments.