🤖 AI Summary
Video diffusion models (VDMs) often suffer from motion inconsistency due to insufficient temporal modeling. To address this, we propose FlowLoss—a novel explicit optical flow matching loss that directly compares RAFT-estimated flow fields between generated and ground-truth videos, departing from conventional deformation-based implicit flow guidance. Furthermore, we introduce a noise-aware dynamic weighting mechanism that adaptively modulates the strength of optical flow supervision according to the noise level during the denoising process. Our method requires no auxiliary networks or post-processing modules. Evaluated on robotic video datasets, it significantly improves temporal motion consistency and accelerates training convergence. The core contribution lies in the first integration of explicit optical flow matching with noise-dependent dynamic weighting into a unified loss function—establishing a new paradigm for temporal modeling in VDMs.
📝 Abstract
Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion. Optical flow supervision is a promising approach to address this, with prior works commonly employing warping-based strategies that avoid explicit flow matching. In this work, we explore an alternative formulation, FlowLoss, which directly compares flow fields extracted from generated and ground-truth videos. To account for the unreliability of flow estimation under high-noise conditions in diffusion, we propose a noise-aware weighting scheme that modulates the flow loss across denoising steps. Experiments on robotic video datasets suggest that FlowLoss improves motion stability and accelerates convergence in early training stages. Our findings offer practical insights for incorporating motion-based supervision into noise-conditioned generative models.