🤖 AI Summary
Existing video-LLMs rely on costly, labor-intensive human annotations, ground-truth subtitles, or third-party APIs to construct preference data—limiting scalability and generalizability. To address this, we propose the first purely video-driven self-alignment framework: it leverages intrinsic, video-grounded self-critique to generate high-quality preference pairs without any external supervision. Our method integrates lightweight frame sampling (32 frames per video) with Direct Preference Optimization (DPO) for end-to-end training. Crucially, it achieves fully unsupervised spatiotemporal joint reasoning alignment—the first such approach. Evaluated on MVBench, PerceptionTest, and EgoSchema, our method attains 74.0% accuracy (new state-of-the-art), +3.9%, and +6.8% absolute gains over supervised baselines, respectively. These results robustly validate the effectiveness and generalizability of the self-critique–preference generation paradigm for video-LLM alignment.
📝 Abstract
Recent advances in video-large language models (Video-LLMs) have led to significant progress in video understanding. Current preference optimization methods often rely on proprietary APIs or ground-truth captions to generate preference data (i.e., pairs of model outputs ranked based on their quality or alignment with human judgment), which is then used to train models for video-language alignment. This approach is both costly and labor-intensive. To address this limitation, we introduce VideoSAVi (Self-Aligned Video Language Model), a self-training pipeline that enables Video-LLMs to reason over video content without external supervision. Our approach includes a self-critiquing mechanism that identifies reasoning errors in the model's initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi then applies Direct Preference Optimization (DPO), which uses the preference data to iteratively train the model, enhancing temporal and spatial reasoning in video understanding. Experiments show that VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements across other benchmarks, including a 3.9% gain on PerceptionTest and a substantial 6.8% improvement on the challenging EgoSchema dataset compared to baseline models. Our model-agnostic approach is computationally efficient, requiring only 32 frames, offering a promising direction for self-aligned video understanding without reliance on external models or annotations.