[slides and audio] Visual Genome%3A Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is introduced to enable the modeling of relationships between objects in images, aiming to bridge the gap between perceptual and cognitive tasks in computer vision. The dataset contains over 100K images, each with an average of 21 objects, 18 attributes, and 18 pairwise relationships. It includes dense annotations of objects, attributes, and relationships, as well as region descriptions and question-answer pairs. The objects, attributes, and relationships are canonicalized to WordNet synsets, providing a structured representation of the image content. The dataset is designed to support comprehensive scene understanding, including object detection, attribute description, and relationship recognition. The paper discusses the crowdsourcing strategies used to collect the data, the verification process, and the canonicalization of concepts to WordNet synsets. The Visual Genome dataset is intended to serve as a benchmark for training and evaluating models that can understand and reason about visual scenes.The Visual Genome dataset is introduced to enable the modeling of relationships between objects in images, aiming to bridge the gap between perceptual and cognitive tasks in computer vision. The dataset contains over 100K images, each with an average of 21 objects, 18 attributes, and 18 pairwise relationships. It includes dense annotations of objects, attributes, and relationships, as well as region descriptions and question-answer pairs. The objects, attributes, and relationships are canonicalized to WordNet synsets, providing a structured representation of the image content. The dataset is designed to support comprehensive scene understanding, including object detection, attribute description, and relationship recognition. The paper discusses the crowdsourcing strategies used to collect the data, the verification process, and the canonicalization of concepts to WordNet synsets. The Visual Genome dataset is intended to serve as a benchmark for training and evaluating models that can understand and reason about visual scenes.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

| Ranjay Krishna · Yuke Zhu · Oliver Groth · Justin Johnson · Kenji Hata · Joshua Kravitz · Stephanie Chen · Yannis Kalantidis · Li-Jia Li · David A. Shamma · Michael S. Bernstein · Li Fei-Fei