24 Jul 2017 | Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein
This paper introduces Neural Module Networks (NMNs) for visual question answering (VQA), a task that requires understanding both visual scenes and natural language. NMNs dynamically compose a collection of jointly-trained neural modules into deep networks based on the linguistic structure of questions. The approach decomposes questions into their linguistic substructures and uses these structures to instance modular networks with reusable components for recognizing objects, classifying attributes, and more. The resulting compound networks are jointly trained. The paper evaluates NMNs on two challenging VQA datasets, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes. The authors also introduce a new synthetic dataset, SHAPES, to test the model's ability to handle highly compositional questions, demonstrating its robustness and generalization capabilities. The paper concludes by discussing future work, including joint learning of network structures and parameters, and the broader potential of NMNs in other domains such as document and structured knowledge base queries.This paper introduces Neural Module Networks (NMNs) for visual question answering (VQA), a task that requires understanding both visual scenes and natural language. NMNs dynamically compose a collection of jointly-trained neural modules into deep networks based on the linguistic structure of questions. The approach decomposes questions into their linguistic substructures and uses these structures to instance modular networks with reusable components for recognizing objects, classifying attributes, and more. The resulting compound networks are jointly trained. The paper evaluates NMNs on two challenging VQA datasets, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes. The authors also introduce a new synthetic dataset, SHAPES, to test the model's ability to handle highly compositional questions, demonstrating its robustness and generalization capabilities. The paper concludes by discussing future work, including joint learning of network structures and parameters, and the broader potential of NMNs in other domains such as document and structured knowledge base queries.