Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high gradient variance, slow convergence, and reliance on normalization layers (e.g., AdaLN) in Diffusion Transformers (DiTs), this work proposes a magnitude-preserving network design and Rotation Modulation—a novel conditional modulation mechanism. The magnitude-preserving design replaces conventional normalization layers by constraining activation magnitudes, thereby enhancing training stability. Rotation Modulation parameterizes conditional transformations on the SO(2) group, substituting AdaLN’s scale-and-shift operations with lightweight, learnable 2D rotations. This is the first introduction of magnitude preservation into DiT architectures and the first rotation-based conditional modulation paradigm for diffusion models. Experiments demonstrate a 12.8% reduction in FID score; combining rotation modulation with scaling matches AdaLN’s performance while reducing parameter count by 5.4%. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract
Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.
Problem

Research questions and friction points this paper is trying to address.

Stabilize training in Diffusion Transformers via magnitude preservation
Introduce rotation modulation as novel conditioning method
Improve performance and reduce parameters in diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Magnitude-preserving design stabilizes DiT training
Rotation modulation replaces traditional conditioning methods
Fewer parameters with competitive performance
🔎 Similar Papers
No similar papers found.
E
Eric Tillman Bill
ETH Zürich, Zürich, Switzerland
C
Cristian Perez Jensen
ETH Zürich, Zürich, Switzerland
Sotiris Anagnostidis
Sotiris Anagnostidis
Anthropic
Deep Learning
D
Dimitri von Rutte
ETH Zürich, Zürich, Switzerland