5 Apr 2023 | Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick
The paper introduces the Segment Anything (SA) project, which aims to build a foundation model for image segmentation. The project consists of three interconnected components: a promptable segmentation task, a segmentation model (SAM), and a data engine for collecting the SA-1B dataset. The promptable segmentation task is designed to enable zero-shot generalization to new image distributions and tasks. SAM is trained to be promptable, allowing it to output segmentation masks in real-time when prompted. The data engine iterates between using SAM to assist in data collection and using newly collected data to improve the model. The resulting SA-1B dataset contains 1.1 billion high-quality masks on 11 million licensed and privacy-respecting images. Experiments demonstrate that SAM's zero-shot performance is impressive, often competitive with or even superior to prior fully supervised results. The paper also includes a Responsible AI (RAI) analysis, evaluating the dataset's geographic and income representation and SAM's fairness across protected attributes. The authors conclude by discussing the role of foundation models and compositionality in computer vision.The paper introduces the Segment Anything (SA) project, which aims to build a foundation model for image segmentation. The project consists of three interconnected components: a promptable segmentation task, a segmentation model (SAM), and a data engine for collecting the SA-1B dataset. The promptable segmentation task is designed to enable zero-shot generalization to new image distributions and tasks. SAM is trained to be promptable, allowing it to output segmentation masks in real-time when prompted. The data engine iterates between using SAM to assist in data collection and using newly collected data to improve the model. The resulting SA-1B dataset contains 1.1 billion high-quality masks on 11 million licensed and privacy-respecting images. Experiments demonstrate that SAM's zero-shot performance is impressive, often competitive with or even superior to prior fully supervised results. The paper also includes a Responsible AI (RAI) analysis, evaluating the dataset's geographic and income representation and SAM's fairness across protected attributes. The authors conclude by discussing the role of foundation models and compositionality in computer vision.