[slides and audio] MoReVQA%3A Exploring Modular Reasoning Models for Video Question Answering

This paper addresses the task of video question answering (videoQA) using a decomposed multi-stage, modular reasoning framework. Previous modular methods, which rely on a single planning stage ungrounded in visual content, have shown promise but can lead to brittle behavior in practice. To address this, the authors propose MoReVQA, a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage, all performed using few-shot prompting of large models. This approach improves over prior work on standard videoQA benchmarks (NEXT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results and extensions to related tasks (grounded videoQA, paragraph captioning). The key contributions include: 1. **Finding that existing single-stage code-generation frameworks, while modular and interpretable, are not well-suited for the complexity of generalizable videoQA**. 2. **Designing a multi-stage modular reasoning system (MoReVQA) that effectively decomposes underlying planning sub-tasks**. 3. **Achieving state-of-the-art results on four standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) across training-free methods**. The paper also includes a detailed analysis of the limitations of single-stage approaches and the benefits of the proposed multi-stage decomposition. The experimental results demonstrate that MoReVQA outperforms both single-stage planning models and other key baselines, providing interpretable and grounded planning and execution traces.This paper addresses the task of video question answering (videoQA) using a decomposed multi-stage, modular reasoning framework. Previous modular methods, which rely on a single planning stage ungrounded in visual content, have shown promise but can lead to brittle behavior in practice. To address this, the authors propose MoReVQA, a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage, all performed using few-shot prompting of large models. This approach improves over prior work on standard videoQA benchmarks (NEXT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results and extensions to related tasks (grounded videoQA, paragraph captioning). The key contributions include: 1. **Finding that existing single-stage code-generation frameworks, while modular and interpretable, are not well-suited for the complexity of generalizable videoQA**. 2. **Designing a multi-stage modular reasoning system (MoReVQA) that effectively decomposes underlying planning sub-tasks**. 3. **Achieving state-of-the-art results on four standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) across training-free methods**. The paper also includes a detailed analysis of the limitations of single-stage approaches and the benefits of the proposed multi-stage decomposition. The experimental results demonstrate that MoReVQA outperforms both single-stage planning models and other key baselines, providing interpretable and grounded planning and execution traces.

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

9 Apr 2024 | Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid