LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational cost and inefficiency encountered by existing video super-resolution (VSR) methods when adapting large-scale pretrained diffusion Transformers. To overcome this, the authors propose a lightweight dual-stream state-aware adapter grounded in the flow matching framework, which achieves efficient cross-domain VSR with the backbone network fully frozen by learning only a fixed injection pattern. The method employs a time-dependent cross-attention mechanism that adaptively shifts from structural alignment to texture refinement throughout the denoising process and is compatible with single-step fast sampling. Remarkably, it attains competitive reconstruction quality and extremely fast inference speed using merely 11.25% trainable parameters and 12 GPU hours on a single A100 card.

📝 Abstract

Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.

Problem

Research questions and friction points this paper is trying to address.

Video Super-Resolution

Diffusion Transformers

Model Adaptation

Cross-domain

Lightweight Adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching

Diffusion Transformer

lightweight adaptation