Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address high gradient variance, slow convergence, and reliance on normalization layers (e.g., AdaLN) in Diffusion Transformers (DiTs), this work proposes a magnitude-preserving network design and Rotation Modulation—a novel conditional modulation mechanism. The magnitude-preserving design replaces conventional normalization layers by constraining activation magnitudes, thereby enhancing training stability. Rotation Modulation parameterizes conditional transformations on the SO(2) group, substituting AdaLN’s scale-and-shift operations with lightweight, learnable 2D rotations. This is the first introduction of magnitude preservation into DiT architectures and the first rotation-based conditional modulation paradigm for diffusion models. Experiments demonstrate a 12.8% reduction in FID score; combining rotation modulation with scaling matches AdaLN’s performance while reducing parameter count by 5.4%. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract

Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.

Problem

Research questions and friction points this paper is trying to address.

Stabilize training in Diffusion Transformers via magnitude preservation

Introduce rotation modulation as novel conditioning method

Improve performance and reduce parameters in diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Magnitude-preserving design stabilizes DiT training

Rotation modulation replaces traditional conditioning methods

Fewer parameters with competitive performance

🔎 Similar Papers

No similar papers found.