[slides and audio] Per-Pixel Classification is Not All You Need for Semantic Segmentation

The paper "Per-Pixel Classification is Not All You Need for Semantic Segmentation" by Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov from Facebook AI Research and University of Illinois at Urbana-Champaign explores the limitations of per-pixel classification in semantic segmentation and proposes a novel approach called MaskFormer. The authors argue that mask classification, which predicts a set of binary masks each associated with a single global class label, is a more general and flexible paradigm that can solve both semantic and instance-level segmentation tasks using the same model, loss, and training procedure. MaskFormer is designed to seamlessly convert existing per-pixel classification models into mask classification models, leveraging the Transformer decoder for efficient and effective segmentation. The proposed method outperforms per-pixel classification baselines, especially in datasets with a large number of classes, and achieves state-of-the-art results on several semantic and panoptic segmentation datasets. The paper also includes ablation studies and comparisons with other models, demonstrating the effectiveness of MaskFormer in unifying semantic and instance-level segmentation tasks.The paper "Per-Pixel Classification is Not All You Need for Semantic Segmentation" by Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov from Facebook AI Research and University of Illinois at Urbana-Champaign explores the limitations of per-pixel classification in semantic segmentation and proposes a novel approach called MaskFormer. The authors argue that mask classification, which predicts a set of binary masks each associated with a single global class label, is a more general and flexible paradigm that can solve both semantic and instance-level segmentation tasks using the same model, loss, and training procedure. MaskFormer is designed to seamlessly convert existing per-pixel classification models into mask classification models, leveraging the Transformer decoder for efficient and effective segmentation. The proposed method outperforms per-pixel classification baselines, especially in datasets with a large number of classes, and achieves state-of-the-art results on several semantic and panoptic segmentation datasets. The paper also includes ablation studies and comparisons with other models, demonstrating the effectiveness of MaskFormer in unifying semantic and instance-level segmentation tasks.

Per-Pixel Classification is Not All You Need for Semantic Segmentation

31 Oct 2021 | Bowen Cheng, Alexander G. Schwing, Alexander Kirillov