🤖 AI Summary
First-order methods (e.g., SGD) converge only to stationary points, while second-order methods—though potentially faster—are hindered by prohibitive Hessian computation costs in large-scale deep learning. To address this, we propose FUSE, the first unified theoretical framework for stochastic first- and second-order optimization. Its core is FUSE-PV, a provably efficient algorithm that dynamically integrates gradient and curvature information via an adaptive switching mechanism. Theoretically, FUSE-PV achieves superior iteration complexity compared to both SGD and Adam. Empirically, it demonstrates significantly accelerated convergence and reduced per-iteration cost on standard test functions as well as image and language benchmark datasets (e.g., CIFAR-10/100, ImageNet, WikiText-2). FUSE bridges theoretical rigor with practical scalability, offering a novel paradigm for stochastic optimization in modern deep learning.
📝 Abstract
Stochastic optimization methods have actively been playing a critical role in modern machine learning algorithms to deliver decent performance. While numerous works have proposed and developed diverse approaches, first-order and second-order methods are in entirely different situations. The former is significantly pivotal and dominating in emerging deep learning but only leads convergence to a stationary point. However, second-order methods are less popular due to their computational intensity in large-dimensional problems. This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework, termed FUSE, from which a practical version (PV) is derived accordingly. FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders. Additionally, we develop different criteria that determine when to switch. FUSE-PV has provably shown a smaller computational complexity than SGD and Adam. To validate our proposed scheme, we present an ablation study on several simple test functions and show a comparison with baselines for benchmark datasets.