Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe frame flickering, spatiotemporal distortion, and high computational cost in long-video generation, this paper proposes a fine-tuning-free global-local collaborative diffusion framework. Methodologically, we design a frequency-aware noise reinitialization strategy—integrating local shuffling with frequency-domain fusion—and introduce a motion-consistency refinement module that jointly optimizes pixel-level and frequency-domain gradients to unify spatiotemporal denoising trajectories. Our core innovation lies in the first deep integration of frequency-domain modeling into both noise reinitialization and motion optimization, enabling synergistic enhancement of content consistency and inter-frame coherence. Experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on both visual fidelity and temporal consistency metrics for videos extended by 3× and 6× their original length.

Technology Category

Application Category

📝 Abstract
Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths ( extit{e.g.}, 3 imes and 6 imes longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.
Problem

Research questions and friction points this paper is trying to address.

Video Generation
Visual Coherence
Resource Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

GLC-Diffusion
VMCR module
noise reset technique
Yongjia Ma
Yongjia Ma
LiAuto
Computer VisionNeural RenderAIGCVLA
J
Junlin Chen
Zhejiang University
Donglin Di
Donglin Di
Li Auto Inc.
Generative ModelsEmbodied AIMedical ImageMultimedia
Qi Xie
Qi Xie
Xi'an Jiaotong University
Machine LearningComputer Vision
L
Lei Fan
University of New South Wales
W
Wei Chen
Space AI, Li Auto
X
Xiaofei Gou
Space AI, Li Auto
N
Na Zhao
Singapore University of Technology and Design
X
Xun Yang
University of Science and Technology of China