π€ AI Summary
Diffusion models suffer from slow inference due to iterative sampling, and existing acceleration methods often compromise fidelity. To address this, we propose a training-free inference acceleration framework that leverages an empirically discovered property: feature relative invariance across timesteps and network layers. Our method introduces a deterministic trajectory extraction mechanism to construct binary cache matrices, enabling joint module-level and full-step-level caching. We further design a quantile-based change metric to dynamically identify cacheable regions and integrate resampling-based correction to preserve reconstruction accuracy. Evaluated on DiT and FLUX architectures, our approach achieves 2β3Γ end-to-end speedup with negligible degradation in generation qualityβno perceptible visual artifacts are observed. This significantly enhances the practical deployability of diffusion models without retraining or architectural modification.
π Abstract
Diffusion models deliver high-fidelity synthesis but remain slow due to iterative sampling. We empirically observe there exists feature invariance in deterministic sampling, and present InvarDiff, a training-free acceleration method that exploits the relative temporal invariance across timestep-scale and layer-scale. From a few deterministic runs, we compute a per-timestep, per-layer, per-module binary cache plan matrix and use a re-sampling correction to avoid drift when consecutive caches occur. Using quantile-based change metrics, this matrix specifies which module at which step is reused rather than recomputed. The same invariance criterion is applied at the step scale to enable cross-timestep caching, deciding whether an entire step can reuse cached results. During inference, InvarDiff performs step-first and layer-wise caching guided by this matrix. When applied to DiT and FLUX, our approach reduces redundant compute while preserving fidelity. Experiments show that InvarDiff achieves $2$-$3 imes$ end-to-end speed-ups with minimal impact on standard quality metrics. Qualitatively, we observe almost no degradation in visual quality compared with full computations.