One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of simultaneously achieving photorealistic spatial detail and strong inter-frame temporal consistency in real-world video super-resolution (Real-VSR), this paper proposes Dual LoRA Collaborative Learning (DLoRAL). Built upon Stable Diffusion, DLoRAL decouples learning into two specialized LoRA modules: Consistency-LoRA (C-LoRA) for modeling temporal coherence and Detail-LoRA (D-LoRA) for texture enhancement. A novel Cross-Frame Retrieval (CFR) module is introduced to align features across frames. Through a two-stage alternating optimization with module freezing, DLoRAL achieves, for the first time, high-fidelity detail preservation and robust temporal consistency within a single diffusion step. Extensive experiments demonstrate state-of-the-art performance across multiple Real-VSR benchmarks. Moreover, DLoRAL accelerates inference by 8.2× over iterative diffusion methods while enabling end-to-end, single-step high-quality reconstruction. The code and models are publicly available.

Technology Category

Application Category

📝 Abstract
It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.
Problem

Research questions and friction points this paper is trying to address.

Enhance video details while maintaining temporal consistency
Extract degradation-robust temporal priors from low-quality input
Achieve realistic details and coherence in one-step diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual LoRA Learning for one-step diffusion
Cross-Frame Retrieval for complementary information
Consistency-LoRA for robust temporal representations
🔎 Similar Papers
No similar papers found.