PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion models suffer from rigid inter-frame temporal synchronization due to scalar timestep scheduling, while existing improvements face trade-offs in computational cost, generalizability, or task specificity. This paper proposes Vectorized Timestep Adaptation (VTA), a framework that replaces fixed scalar timesteps with learnable vector representations—without modifying the backbone architecture—to enable fine-grained temporal dynamics modeling and zero-shot multi-task generation. VTA non-intrusively injects temporal priors, maintaining full compatibility with mainstream diffusion architectures. Trained on only 4K samples at ≤$500 cost, VTA achieves 87.32% on image-to-video (I2V) benchmarking—surpassing Wan-I2V—and supports diverse tasks including text-to-video synthesis, first-and-last-frame generation, and video extrapolation.

Technology Category

Application Category

📝 Abstract
The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with $leq$ 1/200 of the training cost ($500 vs. $geq$ $100,000) and $leq$ 1/2500 of the dataset size (4K vs. $geq$ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32% (vs. 86.86% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension -- all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen
Problem

Research questions and friction points this paper is trying to address.

Overcome rigid frame synchronization in video diffusion models
Address computational inefficiency in temporal modeling
Enable fine-grained control without task-specific training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vectorized timestep adaptation for fine-grained control
Non-destructive adaptation preserving base model capabilities
Efficient training with minimal cost and data
🔎 Similar Papers
No similar papers found.
Yaofang Liu
Yaofang Liu
City University of Hong Kong
Diffusion ModelsVideo GenerationImage Processing
Y
Yumeng Ren
City University of Hong Kong
Aitor Artola
Aitor Artola
postdoc at City University of Hong Kong
machine learningimage processingsignal processing
Y
Yuxuan Hu
The Chinese University of Hong Kong
Xiaodong Cun
Xiaodong Cun
GVC Lab, Great Bay University
Computational PhotographyComputer VisionComputer Graphics
X
Xiaotong Zhao
AI Technology Center, Tencent PCG
A
Alan Zhao
AI Technology Center, Tencent PCG
Raymond H. Chan
Raymond H. Chan
Provost, Lingnan University, Hong Kong
S
Suiyun Zhang
Huawei Research
R
Rui Liu
Huawei Research
D
Dandan Tu
Huawei Research
J
Jean-Michel Morel
City University of Hong Kong