On the Existence of Universal Simulators of Attention

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether the Transformer architecture can **exactly reproduce any attention mechanism**—including its output, underlying matrix operations, and activation functions—**without training and in a data-agnostic manner**. To this end, we construct a **universal attention simulator** $mathcal{U}$, implemented solely with standard Transformer encoder layers, grounded in the RASP formalism. Through purely algorithmic, layer-by-layer design, $mathcal{U}$ replicates the core computational steps of arbitrary attention mechanisms. We provide a rigorous theoretical proof that the Transformer encoder possesses sufficient expressive power to realize **strictly equivalent simulation** of any attention mechanism. This result establishes, for the first time, the **intrinsic completeness** of Transformers for attention modeling—breaking from prior paradigms reliant on parameter learning and approximation. Our work thus provides a foundational, computation-theoretic perspective on the representational capabilities of attention mechanisms within the Transformer framework.

Technology Category

Application Category

📝 Abstract
Prior work on the learnability of transformers has established its capacity to approximate specific algorithmic patterns through training under restrictive architectural assumptions. Fundamentally, these arguments remain data-driven and therefore can only provide a probabilistic guarantee. Expressivity, on the contrary, has theoretically been explored to address the problems emph{computable} by such architecture. These results proved the Turing-completeness of transformers, investigated bounds focused on circuit complexity, and formal logic. Being at the crossroad between learnability and expressivity, the question remains: emph{can transformer architectures exactly simulate an arbitrary attention mechanism, or in particular, the underlying operations?} In this study, we investigate the transformer encoder's ability to simulate a vanilla attention mechanism. By constructing a universal simulator $mathcal{U}$ composed of transformer encoders, we present algorithmic solutions to identically replicate attention outputs and the underlying elementary matrix and activation operations via RASP, a formal framework for transformer computation. Our proofs, for the first time, show the existence of an algorithmically achievable data-agnostic solution, previously known to be approximated only by learning.
Problem

Research questions and friction points this paper is trying to address.

Can transformers exactly simulate arbitrary attention mechanisms?
Investigating transformer encoders' ability to simulate attention.
Proving algorithmic achievability of data-agnostic attention simulation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer encoder simulates vanilla attention
Universal simulator replicates attention outputs
Algorithmic solution via RASP framework
🔎 Similar Papers
No similar papers found.