🤖 AI Summary
This work resolves the open problem of Turing completeness for Softmax-based soft-attention Chain-of-Thought (CoT) Transformers. We construct a length-generalizable CoT architecture incorporating causal masking and relative positional encoding, and provide the first rigorous proof that Softmax CoT Transformers can simulate arbitrary Turing machines over finite unary alphabets and bounded languages—thereby establishing their Turing completeness. To bridge theory and practice, we introduce Counting RASP (C-RASP), a novel theoretical model that captures counting and arithmetic capabilities beyond standard RASP. Empirical evaluation confirms that the proposed architecture successfully models linguistic tasks requiring nonlinear arithmetic reasoning. Crucially, our results transcend the known limitations of hard-attention Transformers, delivering the first formal characterization of the computational power of soft-attention Transformers. This work thus provides foundational theoretical grounding for the expressive capacity of modern attention-based sequence models.
📝 Abstract
Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-complete. In this paper, we prove a stronger result that length-generalizable softmax CoT transformers are Turing-complete. More precisely, our Turing-completeness proof goes via the CoT extension of the Counting RASP (C-RASP), which correspond to softmax CoT transformers that admit length generalization. We prove Turing-completeness for CoT C-RASP with causal masking over a unary alphabet (more generally, for letter-bounded languages). While we show this is not Turing-complete for arbitrary languages, we prove that its extension with relative positional encoding is Turing-complete for arbitrary languages. We empirically validate our theory by training transformers for languages requiring complex (non-linear) arithmetic reasoning.