OnlyFlow: Optical Flow Based Motion Conditioning for Video Diffusion Models

📅 2024-11-15
🏛️ 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of fine-grained motion control—particularly camera motion modeling and video editing—in text-to-video generation. We propose a lightweight, general-purpose, plug-and-play optical flow-guided diffusion framework. Our key innovation is the first direct integration of raw video optical flow as motion priors into text-to-video diffusion models, eliminating the need for manual annotations or task-specific training. The method leverages RAFT-based optical flow estimation, a trainable flow encoder, and a spatiotemporal U-Net backbone to jointly condition diffusion sampling on both optical flow features and text embeddings. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art methods across quantitative metrics (e.g., FVD, FID), visual quality, and human preference studies. Moreover, it exhibits strong generalization to unseen motion types—including translation and zoom-based editing—without retraining.

Technology Category

Application Category

📝 Abstract
We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tackling this problem rely on providing userdefined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation.
Problem

Research questions and friction points this paper is trying to address.

Generating videos with precise motion control using optical flow
Conditioning video generation on input video motion and text prompts
Providing versatile motion control without task-specific training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses optical flow from input video for motion control
Integrates trainable optical flow encoder into backbone model
Enables versatile motion conditioning without task-specific training
🔎 Similar Papers
No similar papers found.