Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the issue of physically implausible motion in existing image-to-video diffusion models, which often arises from phase information degradation during long-step denoising. The study identifies phase degradation as the primary cause of physical inconsistency and introduces PhaseLock, a training-free framework that enforces temporal coherence through a two-stage generation strategy. In the early denoising stage, effective motion priors are extracted and then locked throughout the high-fidelity generation process via Latent Delta Guidance. Evaluated across multiple models, PhaseLock improves physical consistency by an average of 6.2 points, with only a 1.06× increase in inference time and 1.02× memory overhead, while reducing reliance on external guidance by approximately fivefold—all without compromising visual quality.

📝 Abstract

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time).

Problem

Research questions and friction points this paper is trying to address.

physical consistency

image-to-video diffusion

motion priors

phase erosion

denoising

Innovation

Methods, ideas, or system contributions that make the work stand out.

PhaseLock

motion prior

phase preservation