🤖 AI Summary
To address the trade-off between high computational cost of first-order methods and slow high-dimensional convergence of zero-order methods in large-scale nonconvex optimization, this paper proposes the Variance-Reduced Adaptive Mixed Optimization (VAMO) algorithm. VAMO introduces the first SVRG-style framework that synergistically integrates first-order and zero-order information: it combines mini-batch stochastic gradients with adaptive multi-point zero-order finite-difference gradient estimates to achieve variance reduction. Theoretically, VAMO attains a dimension-independent convergence rate of $O(1/T + 1/b)$, where $T$ is the iteration count and $b$ is the mini-batch size. Crucially, it dynamically adjusts the number of zero-order sampling points to balance estimation accuracy and computational overhead. Experiments on neural network training and large language model fine-tuning demonstrate that VAMO significantly outperforms SGD, SVRG, and pure zero-order baselines—achieving faster convergence, lower communication and computational costs, and superior generalization performance.
📝 Abstract
Optimizing large-scale nonconvex problems, common in machine learning, demands balancing rapid convergence with computational efficiency. First-order (FO) stochastic methods like SVRG provide fast convergence and good generalization but incur high costs due to full-batch gradients in large models. Conversely, zeroth-order (ZO) algorithms reduce this burden using estimated gradients, yet their slow convergence in high-dimensional settings limits practicality. We introduce VAMO (VAriance-reduced Mixed-gradient Optimizer), a stochastic variance-reduced method combining FO mini-batch gradients with lightweight ZO finite-difference probes under an SVRG-style framework. VAMO's hybrid design uses a two-point ZO estimator to achieve a dimension-agnostic convergence rate of $mathcal{O}(1/T + 1/b)$, where $T$ is the number of iterations and $b$ is the batch-size, surpassing the dimension-dependent slowdown of purely ZO methods and significantly improving over SGD's $mathcal{O}(1/sqrt{T})$ rate. Additionally, we propose a multi-point ZO variant that mitigates the $O(1/b)$ error by adjusting number of estimation points to balance convergence and cost, making it ideal for a whole range of computationally constrained scenarios. Experiments including traditional neural network training and LLM finetuning show VAMO outperforms established FO and ZO methods, offering a faster, more flexible option for improved efficiency.