The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

264K/year

🤖 AI Summary

This work addresses the lack of theoretical foundation for the Strong Lottery Ticket Hypothesis (SLTH) in Transformer multi-head attention (MHA). We provide the first theoretical proof that a randomly initialized MHA contains a high-performance subnetwork—termed a “strong lottery ticket”—capable of approximating any target MHA arbitrarily well. Leveraging probabilistic analysis and linear algebra tools, combined with structured subnetwork search, we derive an error bound model where the key-value hidden dimension scales as $O(dlog(Hd^{3/2}))$. Our theory shows that such a subnetwork achieves high-accuracy approximation without any training, and its approximation error decays exponentially with increasing source-model hidden dimension $d$. Furthermore, we extend the result to normalization-free Transformer architectures, revealing a novel pathway to high-performance sparse subnetworks via structural selection alone.

Technology Category

Application Category

📝 Abstract

The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of $H$ heads and input dimension $d$ has the hidden dimension $O(dlog(Hd^{3/2}))$ for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.

Problem

Research questions and friction points this paper is trying to address.

Proving strong lottery tickets exist in multi-head attention mechanisms

Extending lottery ticket hypothesis to transformer architectures theoretically

Validating theoretical findings through empirical approximation error analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proves strong lottery tickets exist in multi-head attention

Theoretical analysis extends to transformers without normalization

Approximation error decreases exponentially with hidden dimension

🔎 Similar Papers

No similar papers found.