25 Jul 2021 | Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang
This paper proposes a new approach to semantic segmentation by treating it as a sequence-to-sequence prediction task using pure transformers. The proposed model, called SEgmentation TRansformer (SETR), replaces the traditional encoder-decoder architecture with a pure transformer-based encoder that processes images as sequences of patches. The transformer encoder captures global context at every layer, enabling the model to learn discriminative feature representations. A decoder then reconstructs the image to produce the final segmentation map. The model achieves state-of-the-art results on several benchmark datasets, including ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU), and competitive results on Cityscapes. The paper also introduces three different decoder designs (Naive, PUP, MLA) to evaluate the effectiveness of the encoder's feature representations. The model outperforms existing methods, including FCN and attention-based models, and is ranked first in the ADE20K test server leaderboard. The approach is inspired by the success of transformers in natural language processing and image classification, and it demonstrates that pure transformers can effectively model semantic segmentation without relying on traditional convolutional networks. The model's design allows for a new perspective on semantic segmentation, offering a more efficient and effective solution.This paper proposes a new approach to semantic segmentation by treating it as a sequence-to-sequence prediction task using pure transformers. The proposed model, called SEgmentation TRansformer (SETR), replaces the traditional encoder-decoder architecture with a pure transformer-based encoder that processes images as sequences of patches. The transformer encoder captures global context at every layer, enabling the model to learn discriminative feature representations. A decoder then reconstructs the image to produce the final segmentation map. The model achieves state-of-the-art results on several benchmark datasets, including ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU), and competitive results on Cityscapes. The paper also introduces three different decoder designs (Naive, PUP, MLA) to evaluate the effectiveness of the encoder's feature representations. The model outperforms existing methods, including FCN and attention-based models, and is ranked first in the ADE20K test server leaderboard. The approach is inspired by the success of transformers in natural language processing and image classification, and it demonstrates that pure transformers can effectively model semantic segmentation without relying on traditional convolutional networks. The model's design allows for a new perspective on semantic segmentation, offering a more efficient and effective solution.