SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge of achieving both high temporal consistency and high inference throughput in real-time streaming video-to-video editing. To this end, the authors propose a co-designed algorithm-system framework featuring a hybrid diffusion Transformer architecture, an optical flow–guided recurrent inversion regularization training strategy, and a mixed-precision quantization technique that integrates Gaussian Difference Network (GDN) kernels with Blackwell-optimized hardware awareness. The resulting system achieves real-time end-to-end editing at 24 frames per second (FPS) on a single RTX 5090 GPU at 1280×704 resolution, with the DiT core operating at 58 FPS—significantly outperforming existing approaches in both speed and visual consistency.

📝 Abstract

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

Problem

Research questions and friction points this paper is trying to address.

real-time streaming

video-to-video editing

temporal consistency

inference throughput

high-resolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Diffusion Transformer

Cycle-Reverse Regularization

Mixed-Precision Quantization