Neural Motifs: Scene Graph Parsing with Global Context

Neural Motifs: Scene Graph Parsing with Global Context

29 Mar 2018 | Rowan Zellers, Mark Yatskar, Sam Thomson, Yejin Choi
This paper presents a novel approach to scene graph parsing, focusing on the role of motifs—repeated substructures in scene graphs. The authors analyze the Visual Genome dataset and find that object labels are highly predictive of relation labels, but not vice versa. They also discover that recurring patterns exist even in larger subgraphs, with over 50% of graphs containing motifs involving at least two relations. Based on these findings, they introduce a new baseline that predicts the most frequent relation between object pairs using training set labels, which improves upon previous state-of-the-art results by 3.6% on average. They further propose Stacked Motif Networks (MOTIFNET), a new architecture that captures higher-order motifs in scene graphs, achieving an additional 7.1% relative gain over their strong baseline. The model uses a sequence of bidirectional LSTMs to encode global context, which is then used to predict object labels and relationships. The model is trained end-to-end and achieves significant improvements on the Visual Genome dataset. The paper also discusses the importance of context in scene graph parsing and presents results showing that their approach outperforms previous methods in various evaluation settings. The authors conclude that modeling global context is crucial for effective scene graph parsing and that their approach provides a strong baseline for future research in this area.This paper presents a novel approach to scene graph parsing, focusing on the role of motifs—repeated substructures in scene graphs. The authors analyze the Visual Genome dataset and find that object labels are highly predictive of relation labels, but not vice versa. They also discover that recurring patterns exist even in larger subgraphs, with over 50% of graphs containing motifs involving at least two relations. Based on these findings, they introduce a new baseline that predicts the most frequent relation between object pairs using training set labels, which improves upon previous state-of-the-art results by 3.6% on average. They further propose Stacked Motif Networks (MOTIFNET), a new architecture that captures higher-order motifs in scene graphs, achieving an additional 7.1% relative gain over their strong baseline. The model uses a sequence of bidirectional LSTMs to encode global context, which is then used to predict object labels and relationships. The model is trained end-to-end and achieves significant improvements on the Visual Genome dataset. The paper also discusses the importance of context in scene graph parsing and presents results showing that their approach outperforms previous methods in various evaluation settings. The authors conclude that modeling global context is crucial for effective scene graph parsing and that their approach provides a strong baseline for future research in this area.
Reach us at info@study.space
Understanding Neural Motifs%3A Scene Graph Parsing with Global Context