Improved Training Technique for Shortcut Models

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Shortcut generative models suffer from five key bottlenecks: composite guidance inducing image artifacts, fixed guidance limiting controllability, low-frequency bias causing reconstruction distortion, exponential moving average (EMA) training introducing self-consistency conflicts, and curved flow trajectories hindering convergence. This paper proposes iSM, a unified training framework that formally identifies and characterizes the flaws of composite guidance for the first time, and introduces intrinsic guidance to enable dynamic, controllable sampling. To address frequency bias, we incorporate multi-level wavelet-based losses; to rectify flow paths, we adopt scaled optimal transport (sOT); and to ensure both training stability and self-consistency, we propose a dual-EMA strategy. Evaluated on ImageNet 256×256, iSM significantly improves generation quality across single-step, few-step, and multi-step regimes, achieving substantial FID reduction. Our work advances shortcut models toward an efficient, stable, and flexible general-purpose generative paradigm.

Technology Category

Application Category

📝 Abstract

Shortcut models represent a promising, non-adversarial paradigm for generative modeling, uniquely supporting one-step, few-step, and multi-step sampling from a single trained network. However, their widespread adoption has been stymied by critical performance bottlenecks. This paper tackles the five core issues that held shortcut models back: (1) the hidden flaw of compounding guidance, which we are the first to formalize, causing severe image artifacts; (2) inflexible fixed guidance that restricts inference-time control; (3) a pervasive frequency bias driven by a reliance on low-level distances in the direct domain, which biases reconstructions toward low frequencies; (4) divergent self-consistency arising from a conflict with EMA training; and (5) curvy flow trajectories that impede convergence. To address these challenges, we introduce iSM, a unified training framework that systematically resolves each limitation. Our framework is built on four key improvements: Intrinsic Guidance provides explicit, dynamic control over guidance strength, resolving both compounding guidance and inflexibility. A Multi-Level Wavelet Loss mitigates frequency bias to restore high-frequency details. Scaling Optimal Transport (sOT) reduces training variance and learns straighter, more stable generative paths. Finally, a Twin EMA strategy reconciles training stability with self-consistency. Extensive experiments on ImageNet 256 x 256 demonstrate that our approach yields substantial FID improvements over baseline shortcut models across one-step, few-step, and multi-step generation, making shortcut models a viable and competitive class of generative models.

Problem

Research questions and friction points this paper is trying to address.

Resolving compounding guidance artifacts in generative models

Addressing frequency bias to restore high-frequency image details

Improving training stability and self-consistency in shortcut models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intrinsic Guidance enables dynamic control over guidance strength

Multi-Level Wavelet Loss mitigates frequency bias for details

Scaling Optimal Transport learns straighter generative paths

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models