VideoSAVi: Self-Aligned Video Language Models without Human Supervision

📅 2024-12-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing video-LLMs rely on costly, labor-intensive human annotations, ground-truth subtitles, or third-party APIs to construct preference data—limiting scalability and generalizability. To address this, we propose the first purely video-driven self-alignment framework: it leverages intrinsic, video-grounded self-critique to generate high-quality preference pairs without any external supervision. Our method integrates lightweight frame sampling (32 frames per video) with Direct Preference Optimization (DPO) for end-to-end training. Crucially, it achieves fully unsupervised spatiotemporal joint reasoning alignment—the first such approach. Evaluated on MVBench, PerceptionTest, and EgoSchema, our method attains 74.0% accuracy (new state-of-the-art), +3.9%, and +6.8% absolute gains over supervised baselines, respectively. These results robustly validate the effectiveness and generalizability of the self-critique–preference generation paradigm for video-LLM alignment.

Technology Category

Application Category

📝 Abstract

Recent advances in video-large language models (Video-LLMs) have led to significant progress in video understanding. Current preference optimization methods often rely on proprietary APIs or ground-truth captions to generate preference data (i.e., pairs of model outputs ranked based on their quality or alignment with human judgment), which is then used to train models for video-language alignment. This approach is both costly and labor-intensive. To address this limitation, we introduce VideoSAVi (Self-Aligned Video Language Model), a self-training pipeline that enables Video-LLMs to reason over video content without external supervision. Our approach includes a self-critiquing mechanism that identifies reasoning errors in the model's initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi then applies Direct Preference Optimization (DPO), which uses the preference data to iteratively train the model, enhancing temporal and spatial reasoning in video understanding. Experiments show that VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements across other benchmarks, including a 3.9% gain on PerceptionTest and a substantial 6.8% improvement on the challenging EgoSchema dataset compared to baseline models. Our model-agnostic approach is computationally efficient, requiring only 32 frames, offering a promising direction for self-aligned video understanding without reliance on external models or annotations.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised video-language alignment without human annotations

Automated generation of preference pairs for model training

Enhancing temporal and spatial reasoning in video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-training pipeline for Video-LLMs

Self-critiquing mechanism for error correction

Direct Preference Optimization for iterative training

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs