The paper introduces a new task called Unified Perceptual Parsing (UPP), which aims to recognize multiple visual concepts from a given image, including scene labels, objects, parts, materials, and textures. To address the challenge of handling heterogeneous image annotations, the authors propose UPerNet, a multi-task framework that learns from various image datasets. The framework is designed to overcome the heterogeneity of different datasets and learn to detect various visual concepts jointly. The training strategy involves randomly sampling data sources and updating only the relevant layers to avoid noisy gradients. The model is evaluated on the Broden dataset, which combines multiple densely labeled image datasets, and shows effective performance in segmenting a wide range of concepts. The trained network is further applied to discover visual knowledge in natural scenes, demonstrating the ability to uncover compositional relationships between different concepts. The paper also discusses the limitations of existing deep CNNs and highlights the advantages of UPerNet in handling hierarchical feature representations and heterogeneous annotations.The paper introduces a new task called Unified Perceptual Parsing (UPP), which aims to recognize multiple visual concepts from a given image, including scene labels, objects, parts, materials, and textures. To address the challenge of handling heterogeneous image annotations, the authors propose UPerNet, a multi-task framework that learns from various image datasets. The framework is designed to overcome the heterogeneity of different datasets and learn to detect various visual concepts jointly. The training strategy involves randomly sampling data sources and updating only the relevant layers to avoid noisy gradients. The model is evaluated on the Broden dataset, which combines multiple densely labeled image datasets, and shows effective performance in segmenting a wide range of concepts. The trained network is further applied to discover visual knowledge in natural scenes, demonstrating the ability to uncover compositional relationships between different concepts. The paper also discusses the limitations of existing deep CNNs and highlights the advantages of UPerNet in handling hierarchical feature representations and heterogeneous annotations.