FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing few-step video generation methods suffer from sparse temporal states, which hinder efficient feature reuse and predictive modeling, making inference latency a critical bottleneck. This work proposes FIS-DiT, a training-free, operator-agnostic acceleration framework that, for the first time, uncovers the coexistence of frame-level sparsity and structural consistency along the latent frame dimension. Leveraging this insight, the authors introduce a Frame Interleaved Sparsity (FIS) mechanism that dynamically selects and refreshes only a subset of frame positions within the DiT architecture, thereby avoiding full-block computation while preserving global spatiotemporal context. Evaluated on Wan 2.2 and HunyuanVideo 1.5, FIS-DiT achieves 2.11–2.41× speedup with negligible degradation in VBench-Q and CLIP-based quality metrics, offering a scalable solution for real-time high-definition video generation.

📝 Abstract

While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.

Problem

Research questions and friction points this paper is trying to address.

few-step video inference

inference latency

temporal redundancy

video diffusion models

frame sparsity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frame Interleaved Sparsity

Training-Free Acceleration

Video Diffusion Transformers