Real-Time Motion-Controllable Autoregressive Video Diffusion

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-time controllable video generation faces challenges including high latency in bidirectional diffusion models, weak control capability in autoregressive methods, and poor visual quality in few-step generation. This paper proposes the first few-step autoregressive video diffusion framework integrated with reinforcement learning. It introduces a novel Self-Rollout mechanism to ensure Markovian dynamics and designs a trajectory-aware reward model for fine-grained motion control. A selective denoising strategy is further incorporated to significantly accelerate both training and inference. With only 1.3 billion parameters, the model achieves high visual fidelity while drastically reducing generation latency. It outperforms existing state-of-the-art methods in motion alignment accuracy and control flexibility, enabling real-time, high-fidelity, and precisely controllable video synthesis.

Technology Category

Application Category

📝 Abstract
Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Real-time motion-controllable video generation with low latency
Overcoming quality degradation in autoregressive video diffusion models
Enhancing motion alignment and fidelity with reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning enhanced autoregressive video diffusion model
Self-Rollout mechanism preserves Markov property for training
Selective stochasticity in denoising steps accelerates training process
🔎 Similar Papers
No similar papers found.
K
Kesen Zhao
Nanyang Technological University
J
Jiaxin Shi
Xmax.AI Ltd
Beier Zhu
Beier Zhu
Research Scientist, Nanyang Technological University
Robust Machine Learning
Junbao Zhou
Junbao Zhou
Ph.D Student
Computer Vision3D Vision
X
Xiaolong Shen
Singapore Management University
Y
Yuan Zhou
Nanyang Technological University
Q
Qianru Sun
Zhejiang University
H
Hanwang Zhang
Nanyang Technological University