15 Jun 2022 | Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar
Mask2Former is a universal image segmentation architecture that outperforms specialized models on three major tasks: panoptic, instance, and semantic segmentation. It introduces masked attention in the Transformer decoder, which restricts attention to localized features, improving convergence and performance. Additionally, it uses multi-scale high-resolution features and optimization improvements such as switching the order of self and cross-attention, making query features learnable, and removing dropout. These enhancements allow Mask2Former to achieve state-of-the-art results on four popular datasets: COCO, ADE20K, Cityscapes, and Mapillary Vistas. On COCO, it achieves 57.8 PQ for panoptic segmentation, 50.1 AP for instance segmentation, and 57.7 mIoU for semantic segmentation. Mask2Former is also more memory-efficient, reducing training memory by 3× without affecting performance. It is easy to train and accessible to users with limited computational resources. The architecture is based on a simple meta framework with a backbone feature extractor, pixel decoder, and Transformer decoder. It demonstrates strong performance across different segmentation tasks and is a promising solution for universal image segmentation.Mask2Former is a universal image segmentation architecture that outperforms specialized models on three major tasks: panoptic, instance, and semantic segmentation. It introduces masked attention in the Transformer decoder, which restricts attention to localized features, improving convergence and performance. Additionally, it uses multi-scale high-resolution features and optimization improvements such as switching the order of self and cross-attention, making query features learnable, and removing dropout. These enhancements allow Mask2Former to achieve state-of-the-art results on four popular datasets: COCO, ADE20K, Cityscapes, and Mapillary Vistas. On COCO, it achieves 57.8 PQ for panoptic segmentation, 50.1 AP for instance segmentation, and 57.7 mIoU for semantic segmentation. Mask2Former is also more memory-efficient, reducing training memory by 3× without affecting performance. It is easy to train and accessible to users with limited computational resources. The architecture is based on a simple meta framework with a backbone feature extractor, pixel decoder, and Transformer decoder. It demonstrates strong performance across different segmentation tasks and is a promising solution for universal image segmentation.