24 Jul 2017 | Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein
This paper introduces neural module networks (NMNs), a framework for visual question answering (VQA) that combines the representational power of deep networks with the compositional structure of questions. The approach dynamically constructs and trains neural modules to answer questions by decomposing them into linguistic substructures. Each question is analyzed using a semantic parser to determine the necessary computational units (e.g., attention, classification) and their relationships. These modules are then combined into a deep network, which is jointly trained to produce answers. The resulting model outperforms previous approaches on both the VQA dataset and a new dataset of complex questions about abstract shapes.
The model uses a natural language parser to dynamically lay out a deep network composed of reusable modules. For visual QA tasks, an additional sequence model provides sentence context and learns common-sense knowledge. The model is evaluated on two visual QA tasks: the VQA dataset and a new dataset of complex questions involving spatial relations, set-theoretic reasoning, and shape and attribute recognition. On the new dataset, the model outperforms competing approaches by up to 25% in accuracy.
The paper also introduces a new dataset of highly compositional questions about simple arrangements of shapes, demonstrating that the model significantly outperforms previous work. The model's ability to handle complex questions is attributed to its modular and compositional structure, which allows it to dynamically assemble networks for different tasks. The model is trained jointly, with modules learning to specialize in different tasks through end-to-end training. The approach is generalizable and can be applied to other tasks such as visual referring expression resolution and question answering about natural language texts. The paper concludes that NMNs provide a promising framework for learning complex compositional structures in visual QA and other domains.This paper introduces neural module networks (NMNs), a framework for visual question answering (VQA) that combines the representational power of deep networks with the compositional structure of questions. The approach dynamically constructs and trains neural modules to answer questions by decomposing them into linguistic substructures. Each question is analyzed using a semantic parser to determine the necessary computational units (e.g., attention, classification) and their relationships. These modules are then combined into a deep network, which is jointly trained to produce answers. The resulting model outperforms previous approaches on both the VQA dataset and a new dataset of complex questions about abstract shapes.
The model uses a natural language parser to dynamically lay out a deep network composed of reusable modules. For visual QA tasks, an additional sequence model provides sentence context and learns common-sense knowledge. The model is evaluated on two visual QA tasks: the VQA dataset and a new dataset of complex questions involving spatial relations, set-theoretic reasoning, and shape and attribute recognition. On the new dataset, the model outperforms competing approaches by up to 25% in accuracy.
The paper also introduces a new dataset of highly compositional questions about simple arrangements of shapes, demonstrating that the model significantly outperforms previous work. The model's ability to handle complex questions is attributed to its modular and compositional structure, which allows it to dynamically assemble networks for different tasks. The model is trained jointly, with modules learning to specialize in different tasks through end-to-end training. The approach is generalizable and can be applied to other tasks such as visual referring expression resolution and question answering about natural language texts. The paper concludes that NMNs provide a promising framework for learning complex compositional structures in visual QA and other domains.