21 Mar 2024 | Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai
PSALM is a powerful extension of the Large Multi-Modal Model (LMM) designed to address segmentation tasks. It incorporates a mask decoder and a flexible input schema to handle various segmentation tasks, including images, task instructions, conditional prompts, and mask tokens. This design allows the model to generate and classify segmentation masks effectively. PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization. It achieves superior results on benchmarks like RefCOCO/RefCOCO+/RefCOCOg, COCO Panoptic Segmentation, and COCO-Interactive, and exhibits zero-shot capabilities on unseen tasks such as open-vocabulary segmentation, generalized referring expression segmentation, and video object segmentation. PSALM demonstrates its potential to transform the domain of image segmentation by leveraging the robust visual understanding capabilities of LMMs. The code and models are available at https://github.com/zamling/PSALM. The flexible design, multi-task joint-training, and strong visual understanding capability of LMMs enable PSALM to perform well on in-domain tasks and generalize to out-of-domain tasks in a zero-shot manner. PSALM achieves promising zero-shot performance on tasks like open-vocabulary segmentation, generalized referring expression segmentation, and video object segmentation. Through extensive experiments, PSALM shows strong potential to address general image segmentation tasks and exhibits a certain degree of task generalization capability similar to LLMs in NLP. This work facilitates the realization of the GPT moment in computer vision.PSALM is a powerful extension of the Large Multi-Modal Model (LMM) designed to address segmentation tasks. It incorporates a mask decoder and a flexible input schema to handle various segmentation tasks, including images, task instructions, conditional prompts, and mask tokens. This design allows the model to generate and classify segmentation masks effectively. PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization. It achieves superior results on benchmarks like RefCOCO/RefCOCO+/RefCOCOg, COCO Panoptic Segmentation, and COCO-Interactive, and exhibits zero-shot capabilities on unseen tasks such as open-vocabulary segmentation, generalized referring expression segmentation, and video object segmentation. PSALM demonstrates its potential to transform the domain of image segmentation by leveraging the robust visual understanding capabilities of LMMs. The code and models are available at https://github.com/zamling/PSALM. The flexible design, multi-task joint-training, and strong visual understanding capability of LMMs enable PSALM to perform well on in-domain tasks and generalize to out-of-domain tasks in a zero-shot manner. PSALM achieves promising zero-shot performance on tasks like open-vocabulary segmentation, generalized referring expression segmentation, and video object segmentation. Through extensive experiments, PSALM shows strong potential to address general image segmentation tasks and exhibits a certain degree of task generalization capability similar to LLMs in NLP. This work facilitates the realization of the GPT moment in computer vision.