Masked-attention Mask Transformer for Universal Image Segmentation

Masked-attention Mask Transformer for Universal Image Segmentation

15 Jun 2022 | Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar
Masked-attention Mask Transformer (Mask2Former) is a novel architecture designed to address any image segmentation task, including panoptic, instance, and semantic segmentation. The key innovation in Mask2Former is the use of masked attention, which restricts cross-attention to the foreground region of predicted masks, enabling the model to focus on localized features. This approach not only improves convergence and performance but also reduces training memory requirements. Mask2Former outperforms specialized architectures on multiple datasets, achieving state-of-the-art results in panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO), and semantic segmentation (57.7 mIoU on ADE20K). The architecture is built upon a simple meta framework and includes improvements such as multi-scale high-resolution features and optimization techniques. Mask2Former is designed to be flexible and efficient, making it accessible to users with limited computational resources. The paper also discusses the limitations of the model, particularly in handling small objects, and suggests future directions for improvement.Masked-attention Mask Transformer (Mask2Former) is a novel architecture designed to address any image segmentation task, including panoptic, instance, and semantic segmentation. The key innovation in Mask2Former is the use of masked attention, which restricts cross-attention to the foreground region of predicted masks, enabling the model to focus on localized features. This approach not only improves convergence and performance but also reduces training memory requirements. Mask2Former outperforms specialized architectures on multiple datasets, achieving state-of-the-art results in panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO), and semantic segmentation (57.7 mIoU on ADE20K). The architecture is built upon a simple meta framework and includes improvements such as multi-scale high-resolution features and optimization techniques. Mask2Former is designed to be flexible and efficient, making it accessible to users with limited computational resources. The paper also discusses the limitations of the model, particularly in handling small objects, and suggests future directions for improvement.
Reach us at info@study.space