21 Mar 2024 | Zheng Zhang, Yeyao Ma, Enming Zhang, Xiang Bai
PSALM (Pixelwise SegmentAtion with Large Multi-Modal Model) is a novel approach that extends the capabilities of large multi-modal models (LMMs) to address pixel-level image segmentation tasks. It overcomes the limitation of LMMs, which are primarily designed for text outputs, by incorporating a mask decoder and a flexible input schema. The input schema includes images, task instructions, condition prompts, and mask tokens, enabling the model to generate and classify segmentation masks effectively. PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization. It achieves superior results on benchmarks such as RefCOCO/RefCOCO+/RefCOCog, COCO-Panoptic Segmentation, and COCO-Interactive, and demonstrates zero-shot capabilities on unseen tasks like open-vocabulary segmentation, generalized referring expression segmentation, and video object segmentation. The flexible design of PSALM, leveraging the robust visual understanding capabilities of LMMs, shows strong potential in transforming the domain of image segmentation. The code and models are available at <https://github.com/zamling/PSALM>.PSALM (Pixelwise SegmentAtion with Large Multi-Modal Model) is a novel approach that extends the capabilities of large multi-modal models (LMMs) to address pixel-level image segmentation tasks. It overcomes the limitation of LMMs, which are primarily designed for text outputs, by incorporating a mask decoder and a flexible input schema. The input schema includes images, task instructions, condition prompts, and mask tokens, enabling the model to generate and classify segmentation masks effectively. PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization. It achieves superior results on benchmarks such as RefCOCO/RefCOCO+/RefCOCog, COCO-Panoptic Segmentation, and COCO-Interactive, and demonstrates zero-shot capabilities on unseen tasks like open-vocabulary segmentation, generalized referring expression segmentation, and video object segmentation. The flexible design of PSALM, leveraging the robust visual understanding capabilities of LMMs, shows strong potential in transforming the domain of image segmentation. The code and models are available at <https://github.com/zamling/PSALM>.