MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

πŸ“… 2025-06-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Second-order optimization methods (e.g., KFAC) suffer from high computational overhead and poor scalability in deep Transformer architectures. Method: This paper proposes MACβ€”the first approach to apply Kronecker-factored approximation to the Fisher Information Matrix (FIM) of attention layers, explicitly modeling how attention scores influence curvature; it further introduces a mean-activation-based approximation to construct efficient layer-wise preconditioning matrices, drastically reducing memory and computational costs. Contribution/Results: We theoretically establish two sufficient conditions under which MAC converges to the global optimum in nonlinear networks. Empirically, MAC outperforms KFAC and state-of-the-art second-order baselines across diverse models and datasets, achieving higher final accuracy, shorter end-to-end training time, and reduced GPU memory consumption.

Technology Category

Application Category

πŸ“ Abstract
Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage. Code is available at https://github.com/hseung88/mac.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational burden in second-order neural network optimization
Approximating Kronecker factors in Fisher information matrix efficiently
Enhancing convergence and performance in transformer attention layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Kronecker factorization for FIM
Integrates attention scores in preconditioning
Reduces computational burden in optimization
πŸ”Ž Similar Papers
No similar papers found.
H
Hyunseok Seung
Department of Statistics, University of Georgia
J
Jaewoo Lee
School of Computing, University of Georgia
Hyunsuk Ko
Hyunsuk Ko
Associate Professor, School of Electrical Engineering, Hanyang University ERICA
Video CodingDeep LearningComputer VisionImage Quality Assessment