Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge

📅 2026-05-31
📈 Citations: 0
✨ Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This work addresses the challenge of reasoning over complex spatiotemporal, perspectival, depth, and visibility relationships in video question answering by proposing an adaptive test-time computation framework. The approach first employs a lightweight video-language model for initial predictions and dynamically activates a high-cost dense evidence module only for samples with high prediction uncertainty. It explicitly decouples the process into two subtasks: generating alternative answers and deciding whether to revise the initial answer. By integrating multi-view verification, timestamp-aware frame observation, relational probing, and conservative temporal aggregation, the method achieves efficient yet accurate reasoning. On the VRR-QA benchmark, it attains an average accuracy of 90.07% and a macro-average accuracy of 87.81%.
📝 Abstract
VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.
Problem

Research questions and friction points this paper is trying to address.

video relational reasoning
VRR-QA
dense evidence
temporal aggregation
answer stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive test-time computation
dense evidence refinement
video relational reasoning
uncertainty-aware routing
temporal aggregation
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30