VAMO: Efficient Large-Scale Nonconvex Optimization via Adaptive Zeroth Order Variance Reduction

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the trade-off between high computational cost of first-order methods and slow high-dimensional convergence of zero-order methods in large-scale nonconvex optimization, this paper proposes the Variance-Reduced Adaptive Mixed Optimization (VAMO) algorithm. VAMO introduces the first SVRG-style framework that synergistically integrates first-order and zero-order information: it combines mini-batch stochastic gradients with adaptive multi-point zero-order finite-difference gradient estimates to achieve variance reduction. Theoretically, VAMO attains a dimension-independent convergence rate of $O(1/T + 1/b)$, where $T$ is the iteration count and $b$ is the mini-batch size. Crucially, it dynamically adjusts the number of zero-order sampling points to balance estimation accuracy and computational overhead. Experiments on neural network training and large language model fine-tuning demonstrate that VAMO significantly outperforms SGD, SVRG, and pure zero-order baselines—achieving faster convergence, lower communication and computational costs, and superior generalization performance.

Technology Category

Application Category

📝 Abstract

Optimizing large-scale nonconvex problems, common in machine learning, demands balancing rapid convergence with computational efficiency. First-order (FO) stochastic methods like SVRG provide fast convergence and good generalization but incur high costs due to full-batch gradients in large models. Conversely, zeroth-order (ZO) algorithms reduce this burden using estimated gradients, yet their slow convergence in high-dimensional settings limits practicality. We introduce VAMO (VAriance-reduced Mixed-gradient Optimizer), a stochastic variance-reduced method combining FO mini-batch gradients with lightweight ZO finite-difference probes under an SVRG-style framework. VAMO's hybrid design uses a two-point ZO estimator to achieve a dimension-agnostic convergence rate of $mathcal{O}(1/T + 1/b)$, where $T$ is the number of iterations and $b$ is the batch-size, surpassing the dimension-dependent slowdown of purely ZO methods and significantly improving over SGD's $mathcal{O}(1/sqrt{T})$ rate. Additionally, we propose a multi-point ZO variant that mitigates the $O(1/b)$ error by adjusting number of estimation points to balance convergence and cost, making it ideal for a whole range of computationally constrained scenarios. Experiments including traditional neural network training and LLM finetuning show VAMO outperforms established FO and ZO methods, offering a faster, more flexible option for improved efficiency.

Problem

Research questions and friction points this paper is trying to address.

Balancing convergence and efficiency in large-scale nonconvex optimization

Reducing computational costs of gradient estimation in high dimensions

Improving optimization speed for machine learning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid FO and ZO gradients for efficiency

Two-point ZO estimator for fast convergence

Multi-point ZO variant balances cost and accuracy

🔎 Similar Papers

No similar papers found.