Unified Perceptual Parsing for Scene Understanding

Unified Perceptual Parsing for Scene Understanding

26 Jul 2018 | Tete Xiao1*, Yingcheng Liu1*, Bolei Zhou2*, Yuning Jiang3, Jian Sun4
This paper introduces a new task called Unified Perceptual Parsing (UPP), which aims to recognize as many visual concepts as possible from a given image. The task requires the machine vision system to parse multiple visual concepts simultaneously, including scene, objects, parts, materials, and textures. To achieve this, the authors propose a multi-task framework called UPerNet and a training strategy that can learn from heterogeneous image annotations. The framework is designed to handle the challenges of parsing visual concepts at multiple levels, including the heterogeneity of different datasets and the need to jointly infer and discover visual knowledge from images. The authors first construct a dataset called Broden+ by combining several sources of image annotations. This dataset contains a wide range of visual concepts, including scenes, objects, parts, materials, and textures. They then define metrics to evaluate the performance of their framework, including pixel accuracy (P.A.), mean IoU (mIoU), and mean IoU including background (mIoU-bg). The framework is trained on the Broden+ dataset and is able to effectively segment a wide range of concepts from images. The UPerNet framework is based on the Feature Pyramid Network (FPN) and incorporates a Pyramid Pooling Module (PPM) to enhance the receptive field and improve performance. The framework is able to parse multiple visual concepts simultaneously, including scene classification, object and part segmentation, material segmentation, and texture classification. The authors also propose a training method that enables the network to predict pixel-wise texture labels using only image-level annotations. The authors demonstrate the effectiveness of their framework through experiments on the Broden+ dataset, showing that it can discover rich visual knowledge from images. The framework is able to parse various visual concepts at multiple perceptual levels, including scene, objects, parts, textures, and materials. The results show that the framework can achieve competitive performance on semantic segmentation tasks and can effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes, enabling future vision systems to understand their surroundings better.This paper introduces a new task called Unified Perceptual Parsing (UPP), which aims to recognize as many visual concepts as possible from a given image. The task requires the machine vision system to parse multiple visual concepts simultaneously, including scene, objects, parts, materials, and textures. To achieve this, the authors propose a multi-task framework called UPerNet and a training strategy that can learn from heterogeneous image annotations. The framework is designed to handle the challenges of parsing visual concepts at multiple levels, including the heterogeneity of different datasets and the need to jointly infer and discover visual knowledge from images. The authors first construct a dataset called Broden+ by combining several sources of image annotations. This dataset contains a wide range of visual concepts, including scenes, objects, parts, materials, and textures. They then define metrics to evaluate the performance of their framework, including pixel accuracy (P.A.), mean IoU (mIoU), and mean IoU including background (mIoU-bg). The framework is trained on the Broden+ dataset and is able to effectively segment a wide range of concepts from images. The UPerNet framework is based on the Feature Pyramid Network (FPN) and incorporates a Pyramid Pooling Module (PPM) to enhance the receptive field and improve performance. The framework is able to parse multiple visual concepts simultaneously, including scene classification, object and part segmentation, material segmentation, and texture classification. The authors also propose a training method that enables the network to predict pixel-wise texture labels using only image-level annotations. The authors demonstrate the effectiveness of their framework through experiments on the Broden+ dataset, showing that it can discover rich visual knowledge from images. The framework is able to parse various visual concepts at multiple perceptual levels, including scene, objects, parts, textures, and materials. The results show that the framework can achieve competitive performance on semantic segmentation tasks and can effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes, enabling future vision systems to understand their surroundings better.
Reach us at info@study.space
Understanding Unified Perceptual Parsing for Scene Understanding