🤖 AI Summary
In learning orthogonal multi-index models, conventional information-theoretic indices—relying solely on the lowest-order Hermite coefficient—yield suboptimal sample complexity. Method: This paper introduces a fine-grained information index analysis framework that, for the first time, jointly leverages second- and 2L-th order Hermite moments in a two-stage procedure: (i) estimating the relevant subspace, then (ii) precisely recovering orthogonal directions—thereby overcoming rotational invariance limitations. The approach integrates Hermite polynomial expansions, online stochastic gradient descent (SGD), high-order moment estimation, and subspace decomposition. Contribution/Results: We establish identifiability of the target directions and reduce both sample and computational complexity from $d^{2L-1}cdotmathrm{poly}(P)$ to $dcdotmathrm{poly}(P)$, achieving optimal rates and significantly outperforming existing methods based solely on the lowest-order information index.
📝 Abstract
The information exponent (Ben Arous et al. [2021]) -- which is equivalent to the lowest degree in the Hermite expansion of the link function for Gaussian single-index models -- has played an important role in predicting the sample complexity of online stochastic gradient descent (SGD) in various learning tasks. In this work, we demonstrate that, for multi-index models, focusing solely on the lowest degree can miss key structural details of the model and result in suboptimal rates. Specifically, we consider the task of learning target functions of form $f_*(mathbf{x}) = sum_{k=1}^{P} phi(mathbf{v}_k^* cdot mathbf{x})$, where $P ll d$, the ground-truth directions ${ mathbf{v}_k^* }_{k=1}^P$ are orthonormal, and only the second and $2L$-th Hermite coefficients of the link function $phi$ can be nonzero. Based on the theory of information exponent, when the lowest degree is $2L$, recovering the directions requires $d^{2L-1}mathrm{poly}(P)$ samples, and when the lowest degree is $2$, only the relevant subspace (not the exact directions) can be recovered due to the rotational invariance of the second-order terms. In contrast, we show that by considering both second- and higher-order terms, we can first learn the relevant space via the second-order terms, and then the exact directions using the higher-order terms, and the overall sample and complexity of online SGD is $d mathrm{poly}(P)$.