MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

9 Apr 2024 | Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid
MoReVQA is a multi-stage, modular reasoning model for video question answering (videoQA) that addresses the limitations of single-stage planning methods. The paper introduces MoReVQA, which decomposes the videoQA task into three stages: event parsing, grounding, and reasoning, using an external memory to maintain state and enable flexible design. Unlike traditional single-stage planning methods, MoReVQA incorporates both modularity and multi-stage planning, providing interpretable, grounded planning and execution traces while improving overall accuracy by decomposing task complexity. The model is training-free and uses few-shot prompting of large models, generating interpretable intermediate outputs at each stage. MoReVQA outperforms prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) and extends to related tasks like grounded videoQA and paragraph captioning. The paper also presents a simple baseline, JCEF, which outperforms prior single-stage modular methods by using a large vision-language model to caption frames and feed the captions to an LLM for answering questions. However, JCEF lacks interpretability due to generic captions. MoReVQA improves upon JCEF by incorporating grounding and multi-stage reasoning, leading to state-of-the-art results on videoQA benchmarks. The model is evaluated on four videoQA datasets and shows strong performance across various tasks, including grounded videoQA and paragraph captioning. The paper highlights the importance of decomposing planning tasks and using external memory to enhance interpretability and accuracy in videoQA.MoReVQA is a multi-stage, modular reasoning model for video question answering (videoQA) that addresses the limitations of single-stage planning methods. The paper introduces MoReVQA, which decomposes the videoQA task into three stages: event parsing, grounding, and reasoning, using an external memory to maintain state and enable flexible design. Unlike traditional single-stage planning methods, MoReVQA incorporates both modularity and multi-stage planning, providing interpretable, grounded planning and execution traces while improving overall accuracy by decomposing task complexity. The model is training-free and uses few-shot prompting of large models, generating interpretable intermediate outputs at each stage. MoReVQA outperforms prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) and extends to related tasks like grounded videoQA and paragraph captioning. The paper also presents a simple baseline, JCEF, which outperforms prior single-stage modular methods by using a large vision-language model to caption frames and feed the captions to an LLM for answering questions. However, JCEF lacks interpretability due to generic captions. MoReVQA improves upon JCEF by incorporating grounding and multi-stage reasoning, leading to state-of-the-art results on videoQA benchmarks. The model is evaluated on four videoQA datasets and shows strong performance across various tasks, including grounded videoQA and paragraph captioning. The paper highlights the importance of decomposing planning tasks and using external memory to enhance interpretability and accuracy in videoQA.
Reach us at info@study.space
[slides and audio] MoReVQA%3A Exploring Modular Reasoning Models for Video Question Answering