How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

A fundamental expressivity-trainability trade-off exists between attention mechanisms and state-space models (SSMs) in sequence modeling, yet no unified theoretical framework characterizes it. Method: We propose a unified modeling paradigm based on input-dependent interaction operators, integrating structured dynamical systems modeling, operator spectral analysis, and gradient flow path theory to establish the first formal analytical framework. Contributions: (1) We introduce the “interaction rank gap” theory, exposing an intrinsic tension between expressive dimensionality and long-range gradient propagation. (2) We prove that single-head attention cannot represent certain structured dynamic mappings, and that a k-lag operator is representable by attention if and only if k heads are used—the head-number equivalence theorem. (3) We identify that attention admits distance-invariant gradient paths, whereas linear SSMs suffer exponential gradient decay with distance—the gradient highway result.

Technology Category

Application Category

📝 Abstract

Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.

Problem

Research questions and friction points this paper is trying to address.

Unified framework for attention and state space models

Theoretical trade-offs between expressivity and trainability

Interaction rank gap and head-count equivalence theorem

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework with input-dependent effective interaction operator

Interaction Rank Gap limits single-head attention expressivity

Equivalence Theorem links SSM representation to head count

🔎 Similar Papers

No similar papers found.