Question-Aware Evidence Ledgers for Video Relational Reasoning

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
This work addresses the challenges in video visual relationship reasoning—particularly the reliance on implicit spatial relations, event boundaries, object identities, and conversational context—by proposing a test-time inference framework that moves beyond conventional approaches limited to single salient frames. The method introduces a question-aware evidence ledger that integrates multimodal cues, including open-vocabulary detection, depth cues, pairwise crops, automatic speech recognition (ASR), and scene graphs. A conservative gating strategy selectively revises the initial predictions of a GPT-5.5-based video question-answering model only when independent evidence provides unambiguous support. Evaluated on the VRR-QA challenge test set, the framework achieves 92.95% overall accuracy and 93.79% macro-average accuracy, substantially enhancing the reliability of complex video question answering.
📝 Abstract
The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.
Problem

Research questions and friction points this paper is trying to address.

video relational reasoning
visual question answering
spatial relations
event boundaries
dialogue context
Innovation

Methods, ideas, or system contributions that make the work stand out.

question-aware evidence ledgers
video relational reasoning
evidence-gated reasoning
test-time inference
multimodal evidence integration