LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and inefficiency encountered by existing video super-resolution (VSR) methods when adapting large-scale pretrained diffusion Transformers. To overcome this, the authors propose a lightweight dual-stream state-aware adapter grounded in the flow matching framework, which achieves efficient cross-domain VSR with the backbone network fully frozen by learning only a fixed injection pattern. The method employs a time-dependent cross-attention mechanism that adaptively shifts from structural alignment to texture refinement throughout the denoising process and is compatible with single-step fast sampling. Remarkably, it attains competitive reconstruction quality and extremely fast inference speed using merely 11.25% trainable parameters and 12 GPU hours on a single A100 card.
📝 Abstract
Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.
Problem

Research questions and friction points this paper is trying to address.

Video Super-Resolution
Diffusion Transformers
Model Adaptation
Cross-domain
Lightweight Adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching
Diffusion Transformer
lightweight adaptation
State-Aware Adapter
video super-resolution
🔎 Similar Papers
No similar papers found.