Fast Compute for ML Optimization

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenges in machine learning optimization posed by reliance on manual hyperparameter tuning—such as learning rate and momentum scheduling—and slow convergence on ill-conditioned problems. The authors propose the Scale-Mixture Expectation-Maximization (SM-EM) algorithm, which leverages a variance-mean scale-mixture representation of the loss function. Within an EM framework, SM-EM automatically infers observation and parameter weights without requiring user-specified learning rate or momentum schedules, naturally yielding adaptive scaling and weight decay mechanisms akin to those in Adam-like optimizers. By integrating Nesterov acceleration and reusing sufficient statistics, SM-EM achieves substantial computational efficiency gains. Experiments on ill-conditioned logistic regression demonstrate that its accelerated variant reduces the final loss by up to 13× compared to carefully tuned Adam and achieves a 10× speedup over a 40-point regularization path.

Technology Category

Application Category

📝 Abstract

We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with $p \in \{20, \ldots, 500\}$, SM-EM with Nesterov acceleration attains up to $13\times$ lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a $10\times$ runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.

Problem

Research questions and friction points this paper is trying to address.

optimization

learning-rate scheduling

variance-mean scale mixture

hyperparameter tuning

EM algorithm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale Mixture EM

variance-mean scale mixture

learning-rate-free optimization