🤖 AI Summary
This work addresses the insufficient theoretical understanding of why Sobolev training accelerates convergence in ReLU networks. We establish the first rigorous theoretical framework under the student-teacher setting with Gaussian inputs, deriving exact expressions for the gradient and Hessian of the Sobolev loss. Our analysis reveals that incorporating target derivatives fundamentally improves the condition number of the loss landscape and enhances gradient signal quality, thereby accelerating gradient flow convergence. Crucially, we identify smoothness regularization and enriched gradient information as the core mechanisms underlying convergence acceleration. Extensive numerical experiments corroborate the theory: Sobolev training consistently improves both convergence speed and generalization across shallow and deep ReLU networks. This study provides the first interpretable theoretical foundation for Sobolev training and opens new avenues for designing efficient deep learning optimization methods grounded in functional-space regularization.
📝 Abstract
Sobolev training, which integrates target derivatives into the loss functions, has been shown to accelerate convergence and improve generalization compared to conventional $L^2$ training. However, the underlying mechanisms of this training method remain only partially understood. In this work, we present the first rigorous theoretical framework proving that Sobolev training accelerates the convergence of Rectified Linear Unit (ReLU) networks. Under a student-teacher framework with Gaussian inputs and shallow architectures, we derive exact formulas for population gradients and Hessians, and quantify the improvements in conditioning of the loss landscape and gradient-flow convergence rates. Extensive numerical experiments validate our theoretical findings and show that the benefits of Sobolev training extend to modern deep learning tasks.