Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

133K/year

🤖 AI Summary

Prior empirical work observed that “pause tokens” (e.g., “...”) improve the performance of shallow Transformers, yet their theoretical role remains unexplained. Method: We formalize this phenomenon using circuit complexity theory, modeling bounded- or logarithmic-precision activations, causal masking, and polynomially many pause-token insertions. Contribution/Results: We prove that constant-depth, logarithmic-width Transformers with pause tokens gain strictly greater computational power: they upgrade from computing only a proper subset of $mathsf{AC}^0$ to computing all of $mathsf{AC}^0$ (under bounded precision) or $mathsf{TC}^0$ (under logarithmic precision)—the first rigorous complexity-theoretic separation for Transformers with pause tokens. Empirical validation via parity learning confirms this: a two-layer Transformer succeeds on the parity function only when pause tokens are present; it fails completely without them. Our analysis reveals how pause tokens alleviate expressivity bottlenecks under depth constraints and elucidates their synergistic interplay with width and numerical precision.

Technology Category

Application Category

📝 Abstract

Pause tokens, simple filler symbols such as"...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of $mathsf{AC}^0$ functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to $mathsf{TC}^0$, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

Problem

Research questions and friction points this paper is trying to address.

Pause tokens enhance Transformer expressivity theoretically unexplained

Formally prove pause tokens increase computational expressivity in Transformers

Pause tokens enable Transformers to learn previously unattainable functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pause tokens enhance Transformer expressivity theoretically

Pause tokens enable Transformers to compute broader function classes

Pause tokens improve learning parity in shallow Transformers

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models