Tight Sample Complexity of Transformers

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the generalization capabilities of deep Transformers in chain-of-thought learning settings, establishing tight theoretical bounds on their Vapnik–Chervonenkis (VC) dimension and sample complexity. By integrating statistical learning theory, VC-dimension analysis, and teacher forcing, the study derives the first upper bound of $O(LW \log(TW))$ and lower bound of $\Omega(LW \log(TW/L))$ on the VC dimension for a Transformer of depth $L$, parameter count $W$, and input sequence length $T$. Correspondingly, the required sample complexity for chain-of-thought learning is shown to be bounded above by $O(LW \log((T+T')W))$ and below by $\Omega(LW \log((T+T')W/L))$, where $T'$ denotes the output sequence length. These results quantitatively characterize how model depth, size, and sequence length jointly influence generalization, offering new theoretical insights into the foundations of large language models.
📝 Abstract
We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $Ω(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.
Problem

Research questions and friction points this paper is trying to address.

Transformers
VC dimension
sample complexity
chain-of-thought learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

VC dimension
sample complexity
Transformers
chain-of-thought learning
teacher forcing
🔎 Similar Papers
2023-12-17Bulletin of the American Mathematical SocietyCitations: 59