Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

📅 2024-04-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This paper addresses the global convergence challenge of wide-and-deep neural networks under square loss. We propose δ-GClip, a regularized gradient clipping algorithm with theoretical guarantees. Our method is the first to ensure global convergence for arbitrarily deep wide networks. Key contributions include: (1) the design of the first gradient clipping variant with rigorous convergence guarantees; (2) an adaptive step-size scheduling strategy grounded in the PL* condition, eliminating conventional constraints on network depth or architecture; and (3) a synergistic integration of NTK neighborhood analysis with regularized gradient clipping. We establish a formal proof of convergence to the global minimum. Empirical evaluations demonstrate that δ-GClip matches state-of-the-art heuristic optimizers across CNNs, MLPs, and Transformer architectures—without requiring architectural modifications or hyperparameter tuning beyond standard practice.

Technology Category

Application Category

📝 Abstract

We present and analyze a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks under the squared loss, provided that the layers are of sufficient width. The algorithm presented here, dubbed $delta-$GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded $delta-$GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Lojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.

Problem

Research questions and friction points this paper is trying to address.

Proves convergence to global minima for wide deep networks

Introduces δ-GClip for provable training loss minimization

Competes with state-of-the-art heuristics on transformer architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Regularized gradient clipping for convergence

Step size scheduling for deep networks

Leverages PL* condition for wide networks

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models