🤖 AI Summary
To address high gradient variance, slow convergence, and reliance on normalization layers (e.g., AdaLN) in Diffusion Transformers (DiTs), this work proposes a magnitude-preserving network design and Rotation Modulation—a novel conditional modulation mechanism. The magnitude-preserving design replaces conventional normalization layers by constraining activation magnitudes, thereby enhancing training stability. Rotation Modulation parameterizes conditional transformations on the SO(2) group, substituting AdaLN’s scale-and-shift operations with lightweight, learnable 2D rotations. This is the first introduction of magnitude preservation into DiT architectures and the first rotation-based conditional modulation paradigm for diffusion models. Experiments demonstrate a 12.8% reduction in FID score; combining rotation modulation with scaling matches AdaLN’s performance while reducing parameter count by 5.4%. The implementation is open-sourced.
📝 Abstract
Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.