LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models incur substantial computational costs in image and video generation, and existing acceleration methods often fail to adapt to the dynamic characteristics of different denoising stages, leading to degraded output quality. To address this, this work proposes LESA—a learnable, stage-aware predictor framework that introduces, for the first time, a stage-aware mechanism leveraging Kolmogorov–Arnold Networks (KANs), a multi-stage mixture-of-experts architecture, and a two-phase training strategy. LESA assigns dedicated predictors to distinct noise levels, enabling precise modeling of the diffusion process dynamics. Experiments demonstrate that LESA achieves a 5.0× speedup on FLUX.1-dev with only 1.0% quality loss, a 6.25× acceleration on Qwen-Image while surpassing the state of the art by 20.2%, and a 5.0× speedup on HunyuanVideo with a 24.7% improvement in PSNR.

Technology Category

Application Category

📝 Abstract
Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Models
Model Acceleration
Feature Caching
Stage-Dependent Dynamics
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

LESA
Stage-Aware Prediction
Diffusion Acceleration
Kolmogorov-Arnold Network
Multi-Expert Architecture