🤖 AI Summary
In deep learning optimization, first-order methods (e.g., Adam) suffer from poor generalization, while second-order methods (e.g., K-FAC) achieve superior generalization at prohibitive computational and memory costs. To address this trade-off, we propose a scalable gradient preconditioning framework featuring a novel eigenvalue-shifted Nyström approximation for modeling activation covariance matrices. This approach retains the generalization benefits of second-order optimization without explicit Hessian computation or large-scale matrix inversion, achieving near-linear time and space complexity. Crucially, it avoids costly dense matrix operations while preserving curvature information essential for effective preconditioning. Empirically, our method surpasses Adam and K-FAC in test accuracy across multiple benchmark tasks—including image classification and language modeling—while reducing memory and computational overhead by an order of magnitude compared to standard second-order optimizers. The result is a principled, efficient, and scalable optimization strategy that reconciles strong generalization with practical tractability.
📝 Abstract
Adaptive gradient methods are computationally efficient and converge quickly, but they often suffer from poor generalization. In contrast, second-order methods enhance convergence and generalization but typically incur high computational and memory costs. In this work, we introduce NYSACT, a scalable first-order gradient preconditioning method that strikes a balance between state-of-the-art first-order and second-order optimization methods. NYSACT leverages an eigenvalue-shifted Nyström method to approximate the activation covariance matrix, which is used as a preconditioning matrix, significantly reducing time and memory complexities with minimal impact on test accuracy. Our experiments show that NYSACT not only achieves improved test accuracy compared to both first-order and second-order methods but also demands considerably less computational resources than existing second-order methods.