Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis

📅 2024-10-13
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
In learning orthogonal multi-index models, conventional information-theoretic indices—relying solely on the lowest-order Hermite coefficient—yield suboptimal sample complexity. Method: This paper introduces a fine-grained information index analysis framework that, for the first time, jointly leverages second- and 2L-th order Hermite moments in a two-stage procedure: (i) estimating the relevant subspace, then (ii) precisely recovering orthogonal directions—thereby overcoming rotational invariance limitations. The approach integrates Hermite polynomial expansions, online stochastic gradient descent (SGD), high-order moment estimation, and subspace decomposition. Contribution/Results: We establish identifiability of the target directions and reduce both sample and computational complexity from $d^{2L-1}cdotmathrm{poly}(P)$ to $dcdotmathrm{poly}(P)$, achieving optimal rates and significantly outperforming existing methods based solely on the lowest-order information index.

Technology Category

Application Category

📝 Abstract
The information exponent (Ben Arous et al. [2021]) -- which is equivalent to the lowest degree in the Hermite expansion of the link function for Gaussian single-index models -- has played an important role in predicting the sample complexity of online stochastic gradient descent (SGD) in various learning tasks. In this work, we demonstrate that, for multi-index models, focusing solely on the lowest degree can miss key structural details of the model and result in suboptimal rates. Specifically, we consider the task of learning target functions of form $f_*(mathbf{x}) = sum_{k=1}^{P} phi(mathbf{v}_k^* cdot mathbf{x})$, where $P ll d$, the ground-truth directions ${ mathbf{v}_k^* }_{k=1}^P$ are orthonormal, and only the second and $2L$-th Hermite coefficients of the link function $phi$ can be nonzero. Based on the theory of information exponent, when the lowest degree is $2L$, recovering the directions requires $d^{2L-1}mathrm{poly}(P)$ samples, and when the lowest degree is $2$, only the relevant subspace (not the exact directions) can be recovered due to the rotational invariance of the second-order terms. In contrast, we show that by considering both second- and higher-order terms, we can first learn the relevant space via the second-order terms, and then the exact directions using the higher-order terms, and the overall sample and complexity of online SGD is $d mathrm{poly}(P)$.
Problem

Research questions and friction points this paper is trying to address.

Analyzing multi-index models beyond lowest-degree information exponents
Improving sample complexity bounds for orthogonal target functions
Developing fine-grained analysis combining second- and higher-order terms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses second-order terms to learn relevant subspace
Employs higher-order terms for exact direction recovery
Reduces sample complexity to O(d P^{L-1})
🔎 Similar Papers
No similar papers found.