Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenge of reasoning over complex spatiotemporal, perspectival, depth, and visibility relationships in video question answering by proposing an adaptive test-time computation framework. The approach first employs a lightweight video-language model for initial predictions and dynamically activates a high-cost dense evidence module only for samples with high prediction uncertainty. It explicitly decouples the process into two subtasks: generating alternative answers and deciding whether to revise the initial answer. By integrating multi-view verification, timestamp-aware frame observation, relational probing, and conservative temporal aggregation, the method achieves efficient yet accurate reasoning. On the VRR-QA benchmark, it attains an average accuracy of 90.07% and a macro-average accuracy of 87.81%.

📝 Abstract

VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.

Problem

Research questions and friction points this paper is trying to address.

video relational reasoning

VRR-QA

dense evidence

temporal aggregation

answer stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive test-time computation

dense evidence refinement

video relational reasoning