2 Sep 2021 | Robin Strudel*, Ricardo Garcia*, Ivan Laptev, Cordelia Schmid
Segmenter is a transformer-based model for semantic segmentation that outperforms convolution-based methods by capturing global context at every layer. It leverages the Vision Transformer (ViT) architecture, using image patches as input tokens and decoding them with either a linear decoder or a mask transformer. The model is pre-trained on ImageNet and fine-tuned on semantic segmentation datasets. The linear decoder achieves excellent results, while the mask transformer further improves performance by generating class masks. An extensive ablation study shows that larger models and smaller patch sizes yield better performance. Segmenter achieves state-of-the-art results on ADE20K and Pascal Context datasets and is competitive on Cityscapes. The model uses a transformer encoder-decoder architecture, with the encoder processing image patches and the decoder generating class masks. The model is trained end-to-end with a per-pixel cross-entropy loss and performs well on standard image segmentation benchmarks. Segmenter is efficient, flexible, and fast, with a simple design that allows for trade-offs between precision and runtime. The model outperforms previous convolutional approaches, particularly on challenging datasets, and demonstrates the effectiveness of transformers in semantic segmentation.Segmenter is a transformer-based model for semantic segmentation that outperforms convolution-based methods by capturing global context at every layer. It leverages the Vision Transformer (ViT) architecture, using image patches as input tokens and decoding them with either a linear decoder or a mask transformer. The model is pre-trained on ImageNet and fine-tuned on semantic segmentation datasets. The linear decoder achieves excellent results, while the mask transformer further improves performance by generating class masks. An extensive ablation study shows that larger models and smaller patch sizes yield better performance. Segmenter achieves state-of-the-art results on ADE20K and Pascal Context datasets and is competitive on Cityscapes. The model uses a transformer encoder-decoder architecture, with the encoder processing image patches and the decoder generating class masks. The model is trained end-to-end with a per-pixel cross-entropy loss and performs well on standard image segmentation benchmarks. Segmenter is efficient, flexible, and fast, with a simple design that allows for trade-offs between precision and runtime. The model outperforms previous convolutional approaches, particularly on challenging datasets, and demonstrates the effectiveness of transformers in semantic segmentation.