VAMO: Efficient Large-Scale Nonconvex Optimization via Adaptive Zeroth Order Variance Reduction

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between high computational cost of first-order methods and slow high-dimensional convergence of zero-order methods in large-scale nonconvex optimization, this paper proposes the Variance-Reduced Adaptive Mixed Optimization (VAMO) algorithm. VAMO introduces the first SVRG-style framework that synergistically integrates first-order and zero-order information: it combines mini-batch stochastic gradients with adaptive multi-point zero-order finite-difference gradient estimates to achieve variance reduction. Theoretically, VAMO attains a dimension-independent convergence rate of $O(1/T + 1/b)$, where $T$ is the iteration count and $b$ is the mini-batch size. Crucially, it dynamically adjusts the number of zero-order sampling points to balance estimation accuracy and computational overhead. Experiments on neural network training and large language model fine-tuning demonstrate that VAMO significantly outperforms SGD, SVRG, and pure zero-order baselines—achieving faster convergence, lower communication and computational costs, and superior generalization performance.

Technology Category

Application Category

📝 Abstract
Optimizing large-scale nonconvex problems, common in machine learning, demands balancing rapid convergence with computational efficiency. First-order (FO) stochastic methods like SVRG provide fast convergence and good generalization but incur high costs due to full-batch gradients in large models. Conversely, zeroth-order (ZO) algorithms reduce this burden using estimated gradients, yet their slow convergence in high-dimensional settings limits practicality. We introduce VAMO (VAriance-reduced Mixed-gradient Optimizer), a stochastic variance-reduced method combining FO mini-batch gradients with lightweight ZO finite-difference probes under an SVRG-style framework. VAMO's hybrid design uses a two-point ZO estimator to achieve a dimension-agnostic convergence rate of $mathcal{O}(1/T + 1/b)$, where $T$ is the number of iterations and $b$ is the batch-size, surpassing the dimension-dependent slowdown of purely ZO methods and significantly improving over SGD's $mathcal{O}(1/sqrt{T})$ rate. Additionally, we propose a multi-point ZO variant that mitigates the $O(1/b)$ error by adjusting number of estimation points to balance convergence and cost, making it ideal for a whole range of computationally constrained scenarios. Experiments including traditional neural network training and LLM finetuning show VAMO outperforms established FO and ZO methods, offering a faster, more flexible option for improved efficiency.
Problem

Research questions and friction points this paper is trying to address.

Balancing convergence and efficiency in large-scale nonconvex optimization
Reducing computational costs of gradient estimation in high dimensions
Improving optimization speed for machine learning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid FO and ZO gradients for efficiency
Two-point ZO estimator for fast convergence
Multi-point ZO variant balances cost and accuracy
🔎 Similar Papers
No similar papers found.
J
Jiahe Chen
Department of Computer Science, City University of Hong Kong
Ziye Ma
Ziye Ma
Assistant Professor, CS, City University of Hong Kong
OptimizationMachine LearningEstimation