VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the challenge that existing video generation models often produce logical inconsistencies when adhering to task-specific constraints, while conventional vision-language models (VLMs) employed as solvers struggle to offer fine-grained guidance. The paper proposes a novel approach that repurposes a VLM as a test-time “teacher,” leveraging its perceptual capabilities to construct a differentiable reward function. This reward signal is then used in conjunction with a lightweight LoRA module to enable online optimization, allowing the video generation model to adaptively perform complex spatiotemporal reasoning. Evaluated on VBVR-Bench and RULER-Bench, the method achieves an average improvement of 16.7 points over baseline approaches—substantially outperforming both VLM-as-Solver (+0.4 points) and Best-of-N sampling (+2.2 points)—while maintaining comparable test-time overhead, thereby pushing beyond the inherent reasoning limits of current video generation models.

📝 Abstract

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/

Problem

Research questions and friction points this paper is trying to address.

Video Reasoning

Vision-Language Models

Video Generation Models

Task-specific Rules

Spatiotemporal Details

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Video Reasoning

Test-Time Optimization