This paper proposes a novel approach for semantic segmentation called the Segmentation Transformer, which uses object-contextual representations to improve segmentation performance. The method leverages the relationship between a pixel and its corresponding object region to enhance the representation of each pixel. The approach involves three main steps: (1) learning object regions under the supervision of ground-truth segmentation, (2) computing object region representations by aggregating pixel representations within the object region, and (3) augmenting each pixel's representation with a weighted aggregation of all object region representations based on their relation to the object regions.
The proposed method is implemented using a Transformer encoder-decoder framework, where the decoder cross-attention module is used for object region learning and representation computation, and the encoder cross-attention module is used for computing the object-contextual representation. The method is evaluated on several benchmark datasets including Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff, achieving competitive performance. The method outperforms existing multi-scale and relational context schemes, and is more efficient in terms of computational complexity and memory usage.
The approach is also extended to the panoptic segmentation task, where it achieves state-of-the-art results on the COCO dataset. The method is shown to be effective in improving segmentation quality, particularly in the stuff region, and is able to achieve high performance with a simple backbone such as ResNet-101. The results demonstrate that the object-contextual representation approach is a promising solution for semantic segmentation, offering a balance between performance, computational efficiency, and memory usage.This paper proposes a novel approach for semantic segmentation called the Segmentation Transformer, which uses object-contextual representations to improve segmentation performance. The method leverages the relationship between a pixel and its corresponding object region to enhance the representation of each pixel. The approach involves three main steps: (1) learning object regions under the supervision of ground-truth segmentation, (2) computing object region representations by aggregating pixel representations within the object region, and (3) augmenting each pixel's representation with a weighted aggregation of all object region representations based on their relation to the object regions.
The proposed method is implemented using a Transformer encoder-decoder framework, where the decoder cross-attention module is used for object region learning and representation computation, and the encoder cross-attention module is used for computing the object-contextual representation. The method is evaluated on several benchmark datasets including Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff, achieving competitive performance. The method outperforms existing multi-scale and relational context schemes, and is more efficient in terms of computational complexity and memory usage.
The approach is also extended to the panoptic segmentation task, where it achieves state-of-the-art results on the COCO dataset. The method is shown to be effective in improving segmentation quality, particularly in the stuff region, and is able to achieve high performance with a simple backbone such as ResNet-101. The results demonstrate that the object-contextual representation approach is a promising solution for semantic segmentation, offering a balance between performance, computational efficiency, and memory usage.