🤖 AI Summary
This work addresses the performance degradation and numerical instability of the Shampoo optimizer, which stem from its reliance on stale preconditioners due to the high computational cost of matrix inversion. The paper provides a theoretical analysis of how delayed preconditioner updates affect convergence and stability, and for the first time explicitly links the damping mechanism to the error induced by such delays. Building on this insight, the authors propose FOAM, an adaptive algorithm that dynamically adjusts both the damping factor and the frequency of eigendecompositions. FOAM further incorporates a delay-aware approximation of the update error, enabling robust convergence while substantially reducing runtime. The method thus achieves an effective balance between computational efficiency and numerical stability.
📝 Abstract
Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Crucially, we identify that damping, acting as a numerical stabilizer, can effectively suppress these negative effects. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.