Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work proposes LocalDPO, a novel alignment framework for video diffusion models that overcomes the inefficiency and coarse supervision of existing methods relying on multi-sample ranking and external evaluators. LocalDPO introduces an automatic preference pair generation mechanism based on local spatiotemporal masking: negative samples are created by locally corrupting real videos, and a frozen base model reconstructs the masked regions to form fine-grained, region-level preference pairs. Coupled with a region-aware DPO loss, this approach enables efficient alignment in a single inference pass—without requiring human annotations or external models. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO significantly improves video fidelity, temporal consistency, and human preference scores, outperforming current post-training alignment techniques.

Technology Category

Application Category

📝 Abstract

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.

Problem

Research questions and friction points this paper is trying to address.

video diffusion models

human preference alignment

direct preference optimization

localized supervision

spatio-temporal regions

Innovation

Methods, ideas, or system contributions that make the work stand out.

LocalDPO

video diffusion models

preference optimization