π€ AI Summary
This work addresses a fundamental conflict in standard AdamW, where weight decay and adaptive gradient scaling engage in a βradial tug-of-war,β disrupting the learning of parameter directions and introducing noise. To resolve this, the authors propose AdamO, which explicitly decouples the radial (norm) and tangential (direction) dynamics of parameter updates. AdamO operates in orthogonal subspaces: it applies SGD-style updates to the radial component with curvature-adaptive step sizes, while employing Adam-style adaptive preconditioning for the directional component. Furthermore, it incorporates an architecture-aware update rule tailored for scale-invariant layers. Experiments demonstrate that AdamO consistently outperforms AdamW across vision and language tasks, achieving superior generalization and training stability without requiring additional constraints or hyperparameter tuning.
π Abstract
Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms to expand effective capacity while steering directions to learn features, whereas weight decay indiscriminately suppresses norm growth. This push--pull interaction induces radial oscillations, injecting noise into Adam's second-moment estimates and potentially degrading delicate tangential feature learning. We argue that magnitude and direction play distinct roles and should be decoupled in optimizer dynamics. We propose Orthogonal Dynamics Decoupling and instantiate it as AdamO: an SGD-style update handles the one-dimensional norm control, while Adam's adaptive preconditioning is confined to the tangential subspace. AdamO further incorporates curvature-adaptive radial step sizing and architecture-aware rules and projections for scale-invariant layers and low-dimensional parameters. Experiments on vision and language tasks show that AdamO improves generalization and stability over AdamW without introducing additional complex constraints.