Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing reward models for text-to-video generation struggle to achieve fine-grained semantic alignment and lack systematic validation of prompt conditions with interpretable visual evidence. This work proposes SG-PVR, a novel reward model that introduces a planning–verification reasoning framework grounded in spatiotemporal scene graphs. The approach first decomposes the input prompt into atomic propositions and then leverages structured spatiotemporal scene graphs extracted from the generated video as persistent visual references to verify the truthfulness of each proposition. This ensures explicit, systematic scrutiny of all prompt conditions, supported by interpretable visual justifications. Experiments demonstrate that SG-PVR excels in fine-grained (including temporal) semantic alignment tasks and, when used as a test-time reranker, significantly enhances the compositional semantic fidelity of generated videos.

📝 Abstract

Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.

Problem

Research questions and friction points this paper is trying to address.

reward model

text-to-video generation

semantic alignment

spatio-temporal reasoning

fine-grained verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

plan-and-verify reasoning

spatio-temporal scene graph

video reward model