Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks

📅 2025-06-03
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
This paper investigates the asymptotic dynamics of stochastic gradient descent (SGD) in the sequence single-index model (SSI) and a simplified single-layer attention network, aiming to elucidate how sequential structure—particularly positional encoding and semantic alignment—enhances attention-based learning. Methodologically, we derive the first closed-form expression for the population loss of the SSI, introduce a pair of sufficient statistics to jointly capture semantic and positional information, and leverage high-dimensional stochastic optimization theory combined with asymptotic statistical analysis to obtain an analytical characterization of SGD trajectories. Our key contributions are threefold: (1) We identify a two-phase convergence mechanism—initial escape from uninformative initialization followed by alignment onto the target subspace; (2) We quantify how sequence length and positional encoding accelerate convergence rates; (3) We establish the first rigorous, interpretable theoretical foundation for attention’s efficacy, proving that sequential structure significantly improves learning efficiency.

Technology Category

Application Category

📝 Abstract
We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models.
Problem

Research questions and friction points this paper is trying to address.

Dynamics of SGD in Sequence-Single Index Models
Population loss in terms of semantic and positional alignment
Influence of sequence length and positional encoding on convergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

SGD dynamics in Sequence Single-Index models
Closed-form population loss via alignment statistics
Two-phase training: escape and alignment
🔎 Similar Papers
L
Luca Arnaboldi
IdePHICS Laboratory, École Polytechnique FĂ©dĂ©rale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland
Bruno Loureiro
Bruno Loureiro
École Normale SupĂ©rieure & CNRS
Machine LearningStatistical MechanicsDisordered Systems
Ludovic Stephan
Ludovic Stephan
Assistant Professor, ENSAI
Florent Krzakala
Florent Krzakala
École polytechnique fĂ©dĂ©rale de Lausanne
Statistical MechanicsStatisticsMachine LearningInformation theorySpin Glasses
L
L. ZdeborovĂĄ
SPOC Laboratory, École Polytechnique FĂ©dĂ©rale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland