Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper investigates the asymptotic dynamics of stochastic gradient descent (SGD) in the sequence single-index model (SSI) and a simplified single-layer attention network, aiming to elucidate how sequential structure—particularly positional encoding and semantic alignment—enhances attention-based learning. Methodologically, we derive the first closed-form expression for the population loss of the SSI, introduce a pair of sufficient statistics to jointly capture semantic and positional information, and leverage high-dimensional stochastic optimization theory combined with asymptotic statistical analysis to obtain an analytical characterization of SGD trajectories. Our key contributions are threefold: (1) We identify a two-phase convergence mechanism—initial escape from uninformative initialization followed by alignment onto the target subspace; (2) We quantify how sequence length and positional encoding accelerate convergence rates; (3) We establish the first rigorous, interpretable theoretical foundation for attention’s efficacy, proving that sequential structure significantly improves learning efficiency.

Technology Category

Application Category

📝 Abstract

We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models.

Problem

Research questions and friction points this paper is trying to address.

Dynamics of SGD in Sequence-Single Index Models

Population loss in terms of semantic and positional alignment

Influence of sequence length and positional encoding on convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

SGD dynamics in Sequence Single-Index models

Closed-form population loss via alignment statistics

Two-phase training: escape and alignment

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models

2023-12-28arXiv.orgCitations: 15

Authors to Follow