VRP-SAM: SAM with Visual Reference Prompt

VRP-SAM: SAM with Visual Reference Prompt

30 Mar 2024 | Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li
VRP-SAM is a novel approach that enhances the Segment Anything Model (SAM) by integrating a Visual Reference Prompt (VRP) encoder. This encoder allows SAM to utilize annotated reference images as prompts for segmentation, enabling it to segment specific objects in target images. The VRP encoder supports various annotation formats, including points, boxes, scribbles, and masks, and extends SAM's versatility while preserving its inherent strengths. A meta-learning strategy is employed to improve the generalization ability of VRP-SAM. Extensive experiments on Pascal and COCO datasets show that VRP-SAM achieves state-of-the-art performance in visual reference segmentation with minimal learnable parameters. It demonstrates strong generalization capabilities, allowing it to segment unseen objects and perform cross-domain segmentation. VRP-SAM outperforms existing methods in few-shot segmentation tasks, achieving high mean intersection over union (mIoU) scores on Pascal-5 and COCO-20 datasets. The model is effective in various scenarios, including domain shift and diverse image styles. VRP-SAM also performs well in part segmentation and video object segmentation tasks. The model's performance is validated through ablation studies, showing that the number of queries and the type of annotations significantly affect segmentation results. VRP-SAM is a flexible and robust solution for visual reference segmentation, offering improved generalization and adaptability compared to traditional SAM. The source code and models are available at https://github.com/syp2ysy/VRP-SAM.VRP-SAM is a novel approach that enhances the Segment Anything Model (SAM) by integrating a Visual Reference Prompt (VRP) encoder. This encoder allows SAM to utilize annotated reference images as prompts for segmentation, enabling it to segment specific objects in target images. The VRP encoder supports various annotation formats, including points, boxes, scribbles, and masks, and extends SAM's versatility while preserving its inherent strengths. A meta-learning strategy is employed to improve the generalization ability of VRP-SAM. Extensive experiments on Pascal and COCO datasets show that VRP-SAM achieves state-of-the-art performance in visual reference segmentation with minimal learnable parameters. It demonstrates strong generalization capabilities, allowing it to segment unseen objects and perform cross-domain segmentation. VRP-SAM outperforms existing methods in few-shot segmentation tasks, achieving high mean intersection over union (mIoU) scores on Pascal-5 and COCO-20 datasets. The model is effective in various scenarios, including domain shift and diverse image styles. VRP-SAM also performs well in part segmentation and video object segmentation tasks. The model's performance is validated through ablation studies, showing that the number of queries and the type of annotations significantly affect segmentation results. VRP-SAM is a flexible and robust solution for visual reference segmentation, offering improved generalization and adaptability compared to traditional SAM. The source code and models are available at https://github.com/syp2ysy/VRP-SAM.
Reach us at info@study.space