Per-Pixel Classification is Not All You Need for Semantic Segmentation

Per-Pixel Classification is Not All You Need for Semantic Segmentation

31 Oct 2021 | Bowen Cheng, Alexander G. Schwing, Alexander Kirillov
MaskFormer is a novel approach to semantic segmentation that uses mask classification instead of per-pixel classification. The key insight is that mask classification can handle both semantic and instance-level segmentation tasks using the same model, loss, and training procedure. MaskFormer predicts a set of binary masks, each associated with a single class label. This approach simplifies the landscape of effective methods for semantic and panoptic segmentation and achieves excellent empirical results. MaskFormer outperforms per-pixel classification baselines, especially when the number of classes is large. It achieves state-of-the-art results on ADE20K (55.6 mIoU) and COCO (52.7 PQ). MaskFormer is also effective for instance-level tasks, outperforming DETR on COCO. The model uses a Transformer decoder to compute a set of pairs, each consisting of a class prediction and a mask embedding vector. The mask embedding vector is used to generate binary mask predictions via a dot product with per-pixel embeddings. MaskFormer is efficient, with 10% fewer parameters and 40% fewer FLOPs than per-pixel classification models. The model is compatible with various backbones, including ResNet and Swin-Transformer. MaskFormer is trained with a combination of cross-entropy and binary mask losses. It is evaluated on multiple datasets, including ADE20K, COCO-Stuff-10K, Cityscapes, Mapillary Vistas, and ADE20K-Full. The results show that MaskFormer outperforms per-pixel classification models, especially on datasets with a large number of classes. The model is also effective for panoptic segmentation, outperforming Max-DeepLab. The key advantage of MaskFormer is its ability to unify semantic and instance-level segmentation tasks. The model is simple, efficient, and effective for a wide range of segmentation tasks.MaskFormer is a novel approach to semantic segmentation that uses mask classification instead of per-pixel classification. The key insight is that mask classification can handle both semantic and instance-level segmentation tasks using the same model, loss, and training procedure. MaskFormer predicts a set of binary masks, each associated with a single class label. This approach simplifies the landscape of effective methods for semantic and panoptic segmentation and achieves excellent empirical results. MaskFormer outperforms per-pixel classification baselines, especially when the number of classes is large. It achieves state-of-the-art results on ADE20K (55.6 mIoU) and COCO (52.7 PQ). MaskFormer is also effective for instance-level tasks, outperforming DETR on COCO. The model uses a Transformer decoder to compute a set of pairs, each consisting of a class prediction and a mask embedding vector. The mask embedding vector is used to generate binary mask predictions via a dot product with per-pixel embeddings. MaskFormer is efficient, with 10% fewer parameters and 40% fewer FLOPs than per-pixel classification models. The model is compatible with various backbones, including ResNet and Swin-Transformer. MaskFormer is trained with a combination of cross-entropy and binary mask losses. It is evaluated on multiple datasets, including ADE20K, COCO-Stuff-10K, Cityscapes, Mapillary Vistas, and ADE20K-Full. The results show that MaskFormer outperforms per-pixel classification models, especially on datasets with a large number of classes. The model is also effective for panoptic segmentation, outperforming Max-DeepLab. The key advantage of MaskFormer is its ability to unify semantic and instance-level segmentation tasks. The model is simple, efficient, and effective for a wide range of segmentation tasks.
Reach us at info@study.space
[slides and audio] Per-Pixel Classification is Not All You Need for Semantic Segmentation