🤖 AI Summary
This work addresses the limitations of existing video diffusion models in super-resolution tasks, where strong coupling between conditional and unconditional branches hinders effective exploitation of generative priors for content restoration. The authors propose a training-free diffusion refinement framework featuring a Decoupled Temporal Guidance (DTG) mechanism that separates conditional and unconditional signals along the temporal dimension. Specifically, the unconditional branch is evaluated at cleaner diffusion timesteps to provide forward-looking priors, and its influence is dynamically annealed during sampling to balance structural correction and fine detail recovery. Integrated with a plug-and-play video inpainting module and temporal bias guidance, the method significantly enhances structural fidelity and temporal consistency. Evaluated on the newly introduced GenWarp480 benchmark, it effectively corrects common artifacts—such as facial distortions and limb misalignments—in both AI-generated and real-world videos.
📝 Abstract
Recent progress in video diffusion models has enabled remarkable generative fidelity, yet leveraging these priors for restoration remains limited by the strong coupling between conditional and unconditional branches in standard classifier-free guidance. We introduce a training-free framework that enhances distorted and low-resolution videos by decoupling these signals in time. Our proposed Decoupled Time Guidance (DTG) evaluates the unconditional branch at a cleaner diffusion timestep, providing a lookahead prior that preserves geometry while suppressing replication of warped content. This temporal bias is annealed throughout sampling, allowing the model to transition from structure correction to detail refinement without retraining. Combined with any off-the-shelf restoration module in a plug-and-play manner, our approach improves perceptual coherence and restores plausible structure in AIgenerated and real-world videos alike. To facilitate evaluation, we curate GenWarp480, a benchmark of 4,400 distorted 480p videos synthesized from diverse text-to-video models. GenWarp480 focuses on characteristic generative degradations such as warped faces, body misalignments, and spatial artifacts, providing a purpose-built testbed for assessing robustness to generative errors. Extensive experiments demonstrate that our method achieves significant improvements in structural fidelity and temporal stability without any model training.