🤖 AI Summary
Diffusion models for video super-resolution (VSR) suffer from severe artifacts and temporal inconsistency due to inherent stochasticity. To address this without paired data, we propose a self-supervised Mamba-enhanced framework. Our method introduces: (1) the first self-supervised ControlNet-guided mechanism for degradation-agnostic feature disentanglement; (2) a 3D Selective Scan-driven Video State-Space Module to model long-range spatiotemporal dependencies; and (3) a three-stage hybrid high-resolution/low-resolution training strategy that jointly optimizes latent diffusion priors. Evaluated on real-world VSR benchmarks, our approach significantly outperforms state-of-the-art methods, achieving substantial gains in PSNR (+1.27 dB) and SSIM (+0.021), while generating videos with superior perceptual quality, enhanced inter-frame coherence, and markedly reduced artifacts.
📝 Abstract
Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.