This paper addresses the task of video question answering (videoQA) using a decomposed multi-stage, modular reasoning framework. Previous modular methods, which rely on a single planning stage ungrounded in visual content, have shown promise but can lead to brittle behavior in practice. To address this, the authors propose MoReVQA, a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage, all performed using few-shot prompting of large models. This approach improves over prior work on standard videoQA benchmarks (NEXT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results and extensions to related tasks (grounded videoQA, paragraph captioning). The key contributions include:
1. **Finding that existing single-stage code-generation frameworks, while modular and interpretable, are not well-suited for the complexity of generalizable videoQA**.
2. **Designing a multi-stage modular reasoning system (MoReVQA) that effectively decomposes underlying planning sub-tasks**.
3. **Achieving state-of-the-art results on four standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) across training-free methods**.
The paper also includes a detailed analysis of the limitations of single-stage approaches and the benefits of the proposed multi-stage decomposition. The experimental results demonstrate that MoReVQA outperforms both single-stage planning models and other key baselines, providing interpretable and grounded planning and execution traces.This paper addresses the task of video question answering (videoQA) using a decomposed multi-stage, modular reasoning framework. Previous modular methods, which rely on a single planning stage ungrounded in visual content, have shown promise but can lead to brittle behavior in practice. To address this, the authors propose MoReVQA, a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage, all performed using few-shot prompting of large models. This approach improves over prior work on standard videoQA benchmarks (NEXT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results and extensions to related tasks (grounded videoQA, paragraph captioning). The key contributions include:
1. **Finding that existing single-stage code-generation frameworks, while modular and interpretable, are not well-suited for the complexity of generalizable videoQA**.
2. **Designing a multi-stage modular reasoning system (MoReVQA) that effectively decomposes underlying planning sub-tasks**.
3. **Achieving state-of-the-art results on four standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) across training-free methods**.
The paper also includes a detailed analysis of the limitations of single-stage approaches and the benefits of the proposed multi-stage decomposition. The experimental results demonstrate that MoReVQA outperforms both single-stage planning models and other key baselines, providing interpretable and grounded planning and execution traces.