🤖 AI Summary
This paper investigates the optimization landscape of shallow single-hidden-layer analytic neural networks under mean-squared-error loss for regression. Specifically, it characterizes the strong convexity of neighborhoods around local minima and its implications for the convergence rate of first-order optimizers. Methodologically, the analysis integrates differential topology and Morse theory with stochastic modeling of regression problems and geometric decomposition of the parameter space. The key contribution is the first rigorous proof that, over the efficient parameter regime—i.e., the set of functions realizable only with the given number of neurons—the loss function is almost surely a Morse function for almost all regression tasks; consequently, all local minima possess strongly convex neighborhoods, ensuring linear convergence of gradient-based algorithms. In contrast, within the redundant parameter regime, minima are non-isolated and form lower-dimensional manifolds. These results fundamentally clarify how parameter redundancy shapes optimization dynamics and provide critical theoretical foundations for the efficient training of shallow neural networks.
📝 Abstract
Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divided into two domains: the 'efficient domain' (all parameters for which the respective realization function cannot be generated by a network having a smaller number of neurons) and the 'redundant domain' (the remaining parameters). In almost all regression problems on the efficient domain the optimization landscape only features local minima that are strongly convex. Formally, we will show that for certain randomly picked regression problems the optimization landscape is almost surely a Morse function on the efficient domain. The redundant domain has significantly smaller dimension than the efficient domain and on this domain, potential local minima are never isolated.