FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

First-order methods (e.g., SGD) converge only to stationary points, while second-order methods—though potentially faster—are hindered by prohibitive Hessian computation costs in large-scale deep learning. To address this, we propose FUSE, the first unified theoretical framework for stochastic first- and second-order optimization. Its core is FUSE-PV, a provably efficient algorithm that dynamically integrates gradient and curvature information via an adaptive switching mechanism. Theoretically, FUSE-PV achieves superior iteration complexity compared to both SGD and Adam. Empirically, it demonstrates significantly accelerated convergence and reduced per-iteration cost on standard test functions as well as image and language benchmark datasets (e.g., CIFAR-10/100, ImageNet, WikiText-2). FUSE bridges theoretical rigor with practical scalability, offering a novel paradigm for stochastic optimization in modern deep learning.

Technology Category

Application Category

📝 Abstract

Stochastic optimization methods have actively been playing a critical role in modern machine learning algorithms to deliver decent performance. While numerous works have proposed and developed diverse approaches, first-order and second-order methods are in entirely different situations. The former is significantly pivotal and dominating in emerging deep learning but only leads convergence to a stationary point. However, second-order methods are less popular due to their computational intensity in large-dimensional problems. This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework, termed FUSE, from which a practical version (PV) is derived accordingly. FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders. Additionally, we develop different criteria that determine when to switch. FUSE-PV has provably shown a smaller computational complexity than SGD and Adam. To validate our proposed scheme, we present an ablation study on several simple test functions and show a comparison with baselines for benchmark datasets.

Problem

Research questions and friction points this paper is trying to address.

Unifies first-order and second-order optimization methods.

Reduces computational complexity in large-dimensional problems.

Improves efficiency with a switch-over mechanism.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies first and second-order optimization methods

Introduces switch-over criteria between orders

Reduces computational complexity compared to SGD

🔎 Similar Papers

Multiple importance sampling for stochastic gradient estimation