Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of theoretical understanding regarding the generalization performance of over-parameterized deep neural networks (DNNs) in regression tasks. By establishing an equivalence between the learning dynamics of DNNs with smooth activation functions under gradient descent (GD) or stochastic gradient descent (SGD) and those of kernel methods, the study reveals a precise connection to the neural tangent kernel (NTK). Leveraging nonparametric statistical learning theory, the authors prove that, when the network width grows polynomially with the sample size, both GD and SGD enable the DNN to achieve the minimax optimal rate of convergence in terms of population risk. This result formally establishes that over-parameterized DNNs trained by GD or SGD possess generalization capabilities comparable to those of optimal kernel methods.

📝 Abstract

Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.

Problem

Research questions and friction points this paper is trying to address.

generalization

deep neural networks

gradient methods

minimax rates

over-parameterization

Innovation

Methods, ideas, or system contributions that make the work stand out.

minimax optimality

generalization bounds

Neural Tangent Kernel