12 Apr 2017 | Danfei Xu, Yuke Zhu, Christopher B. Choy, Li Fei-Fei
This paper addresses the problem of generating scene graphs from images, which are visually grounded graphical structures that capture the relationships between objects in an image. The authors propose an end-to-end model that uses iterative message passing to improve predictions of object categories, bounding box offsets, and relationship predicates. The model leverages contextual information by passing messages between bipartite sub-graphs of the scene graph, refining its predictions over multiple iterations. The effectiveness of the model is demonstrated on a new scene graph dataset based on Visual Genome and the NYU Depth v2 dataset for support relation inference. The experiments show that the proposed model significantly outperforms previous methods in generating scene graphs and reasoning about spatial relations.This paper addresses the problem of generating scene graphs from images, which are visually grounded graphical structures that capture the relationships between objects in an image. The authors propose an end-to-end model that uses iterative message passing to improve predictions of object categories, bounding box offsets, and relationship predicates. The model leverages contextual information by passing messages between bipartite sub-graphs of the scene graph, refining its predictions over multiple iterations. The effectiveness of the model is demonstrated on a new scene graph dataset based on Visual Genome and the NYU Depth v2 dataset for support relation inference. The experiments show that the proposed model significantly outperforms previous methods in generating scene graphs and reasoning about spatial relations.