5 Jan 2024 | Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, Chen Change Loy
Open-Vocabulary SAM extends the capabilities of the Segment Anything Model (SAM) by integrating it with CLIP, enabling interactive segmentation and recognition of up to 22,000 classes. The model introduces two knowledge transfer modules: SAM2CLIP and CLIP2SAM. SAM2CLIP distills knowledge from SAM to CLIP using a lightweight transformer adapter, while CLIP2SAM transfers CLIP knowledge to SAM to enhance its recognition capabilities. This approach significantly outperforms simple combinations of SAM and CLIP in tasks like object recognition on the COCO benchmark. The model is trained on diverse datasets and can be applied to both closed-set and open-set environments. It leverages semantic datasets such as COCO, LVIS, and ImageNet-22k to improve recognition and segmentation. The model's performance is validated across various datasets and scenarios, showing over 2% improvement in IoU and 3% in mAP on the COCO dataset. The model also demonstrates significant improvements in recognition on LVIS, achieving over 20% improvements over previous adapters. The model is flexible and can be integrated with various detectors, making it suitable for practical applications. The study highlights the effectiveness of the unified encoder-decoder framework and the importance of knowledge transfer between SAM and CLIP for enhanced segmentation and recognition capabilities.Open-Vocabulary SAM extends the capabilities of the Segment Anything Model (SAM) by integrating it with CLIP, enabling interactive segmentation and recognition of up to 22,000 classes. The model introduces two knowledge transfer modules: SAM2CLIP and CLIP2SAM. SAM2CLIP distills knowledge from SAM to CLIP using a lightweight transformer adapter, while CLIP2SAM transfers CLIP knowledge to SAM to enhance its recognition capabilities. This approach significantly outperforms simple combinations of SAM and CLIP in tasks like object recognition on the COCO benchmark. The model is trained on diverse datasets and can be applied to both closed-set and open-set environments. It leverages semantic datasets such as COCO, LVIS, and ImageNet-22k to improve recognition and segmentation. The model's performance is validated across various datasets and scenarios, showing over 2% improvement in IoU and 3% in mAP on the COCO dataset. The model also demonstrates significant improvements in recognition on LVIS, achieving over 20% improvements over previous adapters. The model is flexible and can be integrated with various detectors, making it suitable for practical applications. The study highlights the effectiveness of the unified encoder-decoder framework and the importance of knowledge transfer between SAM and CLIP for enhanced segmentation and recognition capabilities.