2 Sep 2021 | Robin Strudel*, Ricardo Garcia*, Ivan Laptev, Cordelia Schmid
The paper introduces Segmenter, a transformer-based model for semantic image segmentation. Unlike convolutional methods, Segmenter captures global context from the first layer and throughout the network. It leverages pre-trained Vision Transformers (ViT) and uses either a linear decoder or a mask transformer decoder to generate class labels from patch embeddings. The model achieves excellent results on the ADE20K and Pascal Context datasets, outperforming state-of-the-art convolutional approaches by a significant margin. An extensive ablation study shows that large models and small patch sizes improve performance. Segmenter is also competitive on the Cityscapes dataset. The paper contributes a novel transformer-based approach for semantic segmentation, demonstrating its effectiveness and flexibility in handling various tasks.The paper introduces Segmenter, a transformer-based model for semantic image segmentation. Unlike convolutional methods, Segmenter captures global context from the first layer and throughout the network. It leverages pre-trained Vision Transformers (ViT) and uses either a linear decoder or a mask transformer decoder to generate class labels from patch embeddings. The model achieves excellent results on the ADE20K and Pascal Context datasets, outperforming state-of-the-art convolutional approaches by a significant margin. An extensive ablation study shows that large models and small patch sizes improve performance. Segmenter is also competitive on the Cityscapes dataset. The paper contributes a novel transformer-based approach for semantic segmentation, demonstrating its effectiveness and flexibility in handling various tasks.