Fast Compute for ML Optimization

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in machine learning optimization posed by reliance on manual hyperparameter tuning—such as learning rate and momentum scheduling—and slow convergence on ill-conditioned problems. The authors propose the Scale-Mixture Expectation-Maximization (SM-EM) algorithm, which leverages a variance-mean scale-mixture representation of the loss function. Within an EM framework, SM-EM automatically infers observation and parameter weights without requiring user-specified learning rate or momentum schedules, naturally yielding adaptive scaling and weight decay mechanisms akin to those in Adam-like optimizers. By integrating Nesterov acceleration and reusing sufficient statistics, SM-EM achieves substantial computational efficiency gains. Experiments on ill-conditioned logistic regression demonstrate that its accelerated variant reduces the final loss by up to 13× compared to carefully tuned Adam and achieves a 10× speedup over a 40-point regularization path.

Technology Category

Application Category

📝 Abstract
We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with $p \in \{20, \ldots, 500\}$, SM-EM with Nesterov acceleration attains up to $13\times$ lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a $10\times$ runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.
Problem

Research questions and friction points this paper is trying to address.

optimization
learning-rate scheduling
variance-mean scale mixture
hyperparameter tuning
EM algorithm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale Mixture EM
variance-mean scale mixture
learning-rate-free optimization
weighted least squares EM
Nesterov acceleration
🔎 Similar Papers
No similar papers found.