28 Oct 2021 | Enze Xie1, Wenhai Wang2, Zhiding Yu3, Anima Anandkumar3,4, Jose M. Alvarez3, Ping Luo1
SegFormer is a novel semantic segmentation framework that integrates Transformers with lightweight multi-layer perceptron (MLP) decoders. It features a hierarchical Transformer encoder that outputs multiscale features without positional encoding, and an MLP decoder that combines local and global attention to generate powerful representations. This design achieves both efficiency and accuracy, as demonstrated by its performance on datasets like ADE20K and Cityscapes. SegFormer-B4, for instance, achieves 50.3% mIoU on ADE20K with 64M parameters, outperforming previous methods by 5x in size and 2.2% in accuracy. The largest model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set, showing excellent zero-shot robustness. The paper also includes extensive experiments and ablation studies to validate the effectiveness of SegFormer's design choices.SegFormer is a novel semantic segmentation framework that integrates Transformers with lightweight multi-layer perceptron (MLP) decoders. It features a hierarchical Transformer encoder that outputs multiscale features without positional encoding, and an MLP decoder that combines local and global attention to generate powerful representations. This design achieves both efficiency and accuracy, as demonstrated by its performance on datasets like ADE20K and Cityscapes. SegFormer-B4, for instance, achieves 50.3% mIoU on ADE20K with 64M parameters, outperforming previous methods by 5x in size and 2.2% in accuracy. The largest model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set, showing excellent zero-shot robustness. The paper also includes extensive experiments and ablation studies to validate the effectiveness of SegFormer's design choices.