🤖 AI Summary
This work addresses the challenge of fine-grained motion control—particularly camera motion modeling and video editing—in text-to-video generation. We propose a lightweight, general-purpose, plug-and-play optical flow-guided diffusion framework. Our key innovation is the first direct integration of raw video optical flow as motion priors into text-to-video diffusion models, eliminating the need for manual annotations or task-specific training. The method leverages RAFT-based optical flow estimation, a trainable flow encoder, and a spatiotemporal U-Net backbone to jointly condition diffusion sampling on both optical flow features and text embeddings. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art methods across quantitative metrics (e.g., FVD, FID), visual quality, and human preference studies. Moreover, it exhibits strong generalization to unseen motion types—including translation and zoom-based editing—without retraining.
📝 Abstract
We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tackling this problem rely on providing userdefined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation.