MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

📅 2024-04-09

🏛️ Computer Vision and Pattern Recognition

📈 Citations: 16

✨ Influential: 4

career value

159K/year

🤖 AI Summary

Existing single-stage planning methods for video question answering (videoQA) suffer from poor robustness, weak visual grounding, and limited interpretability. To address these issues, this paper proposes a training-free, multi-stage modular reasoning framework that decomposes the task into three sequential phases: event-structure parsing, visual content grounding, and final answer inference—each implemented via few-shot prompting of large language models (LLMs) or multimodal LMs, without fine-tuning. By aligning reasoning steps with cognitive hierarchies, our approach explicitly couples high-level planning with low-level visual evidence—an innovation not achieved by prior methods. Evaluated on NExT-QA, iVQA, EgoSchema, and ActivityNet-QA, it achieves state-of-the-art performance. Moreover, it generalizes successfully to grounded videoQA and paragraph-level video captioning, demonstrating substantial improvements in generalization, robustness, and interpretability.

Technology Category

Application Category

📝 Abstract

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular rea-soning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective base-line, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on stan-dard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).

Problem

Research questions and friction points this paper is trying to address.

Decomposing videoQA into multi-stage modular reasoning

Addressing brittleness in single-stage planning methods

Improving interpretability via training-free few-shot prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage modular reasoning framework

Training-free few-shot prompting technique

External memory enhanced interpretable outputs

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs