AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation methods face a fundamental trade-off between temporal consistency and length flexibility: asynchronous autoregressive models suffer from error accumulation due to training-inference inconsistency, while synchronous diffusion models are constrained by fixed-length sequence dependencies. This work proposes the Asynchronous Autoregressive Diffusion Framework (AADD), the first to unify the diffusion corruption process across both training and inference. AADD enforces frame-level non-decreasing time-step constraints and employs temporal causal attention to ensure long-range coherence. It further introduces two dynamic time-step schedulers—Frame-wise Progressive Planning (FoPP) and Adaptive Diffusion (AD)—enabling flexible synchronous or asynchronous, fixed- or variable-length generation. Evaluated on four mainstream benchmarks, AADD achieves state-of-the-art performance, significantly improving temporal coherence and generation robustness—especially for long videos.

Technology Category

Application Category

📝 Abstract
The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, which achieves competitive and state-of-the-art results across four challenging benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Combines auto-regressive and diffusion models for video generation.
Reduces training-inference discrepancy in video frame synthesis.
Enables flexible, asynchronous video generation with temporal coherence.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines auto-regressive and diffusion models
Uses temporal causal attention for coherence
Introduces specialized timestep schedulers
M
Mingzhen Sun
IA, CAS, UCAS
W
Weining Wang
IA, CAS, UCAS
G
Gen Li
Bytedance Inc.
J
Jiawei Liu
Bytedance Inc.
Jiahui Sun
Jiahui Sun
Shanghai Jiao Tong University
System
Wanquan Feng
Wanquan Feng
USTC
computer vision
S
Shanshan Lao
Bytedance Inc.
S
Siyu Zhou
Bytedance Inc.
Qian He
Qian He
ByteDance
J
Jing Liu
IA, CAS, UCAS