OnlyFlow: Optical Flow Based Motion Conditioning for Video Diffusion Models

📅 2024-11-15

🏛️ 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

📈 Citations: 5

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge of fine-grained motion control—particularly camera motion modeling and video editing—in text-to-video generation. We propose a lightweight, general-purpose, plug-and-play optical flow-guided diffusion framework. Our key innovation is the first direct integration of raw video optical flow as motion priors into text-to-video diffusion models, eliminating the need for manual annotations or task-specific training. The method leverages RAFT-based optical flow estimation, a trainable flow encoder, and a spatiotemporal U-Net backbone to jointly condition diffusion sampling on both optical flow features and text embeddings. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art methods across quantitative metrics (e.g., FVD, FID), visual quality, and human preference studies. Moreover, it exhibits strong generalization to unseen motion types—including translation and zoom-based editing—without retraining.

Technology Category

Application Category

📝 Abstract

We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tackling this problem rely on providing userdefined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation.

Problem

Research questions and friction points this paper is trying to address.

Generating videos with precise motion control using optical flow

Conditioning video generation on input video motion and text prompts

Providing versatile motion control without task-specific training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses optical flow from input video for motion control

Integrates trainable optical flow encoder into backbone model

Enables versatile motion conditioning without task-specific training

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence