SwiftVR: Real-Time One-Step Generative Video Restoration

๐Ÿ“… 2026-06-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of single-frame latency and GPU memory bottlenecks in real-time, high-resolution video inpainting for live streaming scenarios by proposing SwiftVR, a one-step generative video inpainting framework grounded in a causal chunking protocol. SwiftVR introduces a mask-free shifted-window self-attention mechanism coupled with an inpainting-aware lightweight autoencoder, enabling efficient inference under standard dense attentionโ€”without requiring custom sparse kernels or cyclic shifts. The method achieves real-time 1080p inpainting at 26 FPS on consumer-grade GPUs (e.g., RTX 5090) and scales to 2560ร—1440@31 FPS and 4K@14 FPS on H100 accelerators, significantly outperforming existing diffusion-based approaches while maintaining low latency, minimal memory footprint, and high reconstruction quality.
๐Ÿ“ Abstract
Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.
Problem

Research questions and friction points this paper is trying to address.

video restoration
real-time
diffusion models
consumer GPU
high-resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

one-step diffusion
real-time video restoration
mask-free shifted-window attention
restoration-aware autoencoder
consumer GPU deployment
J
Jiaqi Yan
State Key Laboratory of Internet of Things for Smart City, Department of Computer and Information Science, University of Macau; Institute of Artificial Intelligence (TeleAI), China Telecom
Xiangyu Chen
Xiangyu Chen
Institute of Artificial Intelligence, China Telecom (TeleAI)
Low-Level VisionMultimodal UnderstandingMultimodal Generation
X
Xinlin Zhong
Institute of Artificial Intelligence (TeleAI), China Telecom; State Key Laboratory for Novel Software Technology, Nanjing University
Haibin Huang
Haibin Huang
Principal Research Scientist at TeleAI
Computer GraphicsComputer VisionGeometric Modeling3D Deep Learning
C
Chi Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
Jie Liu
Jie Liu
Nanjing University
Jiantao Zhou
Jiantao Zhou
Professor, Department of Computer and Information Science, University of Macau
Information Forensics and SecurityMultimedia Signal ProcessingMachine Learning
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom