Rethinking Bregman Divergences in Kronecker-Factored Optimizers

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

169K/year
🤖 AI Summary
This work investigates the distinct error allocation behaviors of different Bregman divergences—Frobenius, von Neumann, and LogDet—in the spectral domain when the covariance matrix deviates from an exact Kronecker structure. Through spectral analysis of the covariance matrix, the study reveals a structural property wherein the top eigensubspace aligns closely with the Hessian, while the tail subspace is dominated by noise. Leveraging this insight, the authors propose a subspace-aware Kronecker optimizer that applies eigenvalue-based preconditioning in the reliable top subspace and introduces an adaptive isotropic acceleration constant in the noise-dominated bottom subspace. This approach effectively separates signal from noise components, yielding more stable and efficient optimization performance under non-ideal Kronecker structures.
📝 Abstract
Shampoo-style optimizers approximate gradient covariance matrices using Kronecker-factored structures. Recent work~\cite{lin2026understanding} showed that such approximations can be viewed as projections under Bregman matrix divergences, leading to different Kronecker-factored preconditioners. However, it remains unclear what role the choice of divergence plays when the covariance is not exactly Kronecker-factored. We study this question through the spectrum of the covariance matrix. We show that Frobenius, von Neumann, and LogDet divergences distribute the unavoidable Kronecker approximation error differently across the covariance spectrum. We further show that their Kronecker factors are governed by divergence-weighted residuals rather than the raw approximation error, explaining how these spectral preferences are realized in the resulting preconditioners. Empirically, we observe that the top covariance eigenspace is substantially better aligned with the Hessian matrix, while the tail spectrum is much noisier and unreliable. Motivated by these findings, we propose a subspace-aware Kronecker optimizer that applies eigenvalue-based preconditioning in the top subspace and uses an adaptive isotropic acceleration constant in the bottom subspace.
Problem

Research questions and friction points this paper is trying to address.

Bregman Divergences
Kronecker-Factored Optimizers
Covariance Approximation
Spectral Analysis
Preconditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bregman divergence
Kronecker-factored optimizer
covariance spectrum
subspace-aware preconditioning
Shampoo-style optimizer