25 Jul 2021 | Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang
This paper reconsiders semantic segmentation from a sequence-to-sequence perspective using transformers. Unlike traditional fully convolutional networks (FCNs) that use an encoder-decoder architecture to reduce spatial resolution and learn abstract visual concepts, the proposed Segmentation Transformer (SETR) employs a pure transformer to encode images as sequences of patches. This approach retains the global context modeling at each layer of the transformer, avoiding the need for resolution reduction. The encoder is combined with a simple decoder to form a powerful segmentation model. Extensive experiments show that SETR achieves state-of-the-art results on ADE20K (50.28% mIoU) and Pascal Context (55.83% mIoU), and competitive results on Cityscapes. Notably, SETR ranks first in the ADE20K test server leaderboard on the day of submission. The paper contributes by reformulating semantic segmentation as a sequence-to-sequence task, proposing a pure transformer encoder, and demonstrating superior feature representations compared to FCNs with and without attention modules.This paper reconsiders semantic segmentation from a sequence-to-sequence perspective using transformers. Unlike traditional fully convolutional networks (FCNs) that use an encoder-decoder architecture to reduce spatial resolution and learn abstract visual concepts, the proposed Segmentation Transformer (SETR) employs a pure transformer to encode images as sequences of patches. This approach retains the global context modeling at each layer of the transformer, avoiding the need for resolution reduction. The encoder is combined with a simple decoder to form a powerful segmentation model. Extensive experiments show that SETR achieves state-of-the-art results on ADE20K (50.28% mIoU) and Pascal Context (55.83% mIoU), and competitive results on Cityscapes. Notably, SETR ranks first in the ADE20K test server leaderboard on the day of submission. The paper contributes by reformulating semantic segmentation as a sequence-to-sequence task, proposing a pure transformer encoder, and demonstrating superior feature representations compared to FCNs with and without attention modules.