🤖 AI Summary
Existing single-stage planning methods for video question answering (videoQA) suffer from poor robustness, weak visual grounding, and limited interpretability. To address these issues, this paper proposes a training-free, multi-stage modular reasoning framework that decomposes the task into three sequential phases: event-structure parsing, visual content grounding, and final answer inference—each implemented via few-shot prompting of large language models (LLMs) or multimodal LMs, without fine-tuning. By aligning reasoning steps with cognitive hierarchies, our approach explicitly couples high-level planning with low-level visual evidence—an innovation not achieved by prior methods. Evaluated on NExT-QA, iVQA, EgoSchema, and ActivityNet-QA, it achieves state-of-the-art performance. Moreover, it generalizes successfully to grounded videoQA and paragraph-level video captioning, demonstrating substantial improvements in generalization, robustness, and interpretability.
📝 Abstract
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular rea-soning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective base-line, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on stan-dard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).