Understanding Object-Contextual Representations for Semantic Segmentation

The paper introduces the Segmentation Transformer, a novel approach for semantic segmentation that focuses on context aggregation. The authors propose object-contextual representations (OCR), which characterizes a pixel by exploiting the representation of the corresponding object class. The method involves three main steps: learning object regions using ground-truth segmentation, computing object region representations by aggregating pixel representations within each object region, and augmenting pixel representations with object-contextual representations, which are weighted aggregations of all object region representations. The authors demonstrate that their method achieves competitive performance on various benchmarks, including Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. They also show that their approach outperforms multi-scale context schemes and relational context schemes in terms of both performance and efficiency. The Segmentation Transformer is implemented using a Transformer encoder-decoder framework, where the decoder's cross-attention module integrates object region learning and representation computation, and the encoder's cross-attention module processes the decoder's output to compute object contextual representations. The paper includes extensive experiments and comparisons with state-of-the-art methods, highlighting the effectiveness and efficiency of the proposed approach.The paper introduces the Segmentation Transformer, a novel approach for semantic segmentation that focuses on context aggregation. The authors propose object-contextual representations (OCR), which characterizes a pixel by exploiting the representation of the corresponding object class. The method involves three main steps: learning object regions using ground-truth segmentation, computing object region representations by aggregating pixel representations within each object region, and augmenting pixel representations with object-contextual representations, which are weighted aggregations of all object region representations. The authors demonstrate that their method achieves competitive performance on various benchmarks, including Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. They also show that their approach outperforms multi-scale context schemes and relational context schemes in terms of both performance and efficiency. The Segmentation Transformer is implemented using a Transformer encoder-decoder framework, where the decoder's cross-attention module integrates object region learning and representation computation, and the encoder's cross-attention module processes the decoder's output to compute object contextual representations. The paper includes extensive experiments and comparisons with state-of-the-art methods, highlighting the effectiveness and efficiency of the proposed approach.

Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation

30 Apr 2021 | Yuhui Yuan, Xiaokang Chen, Xilin Chen, Jingdong Wang