The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing theoretical analyses of Transformer expressivity predominantly rely on hardmax attention or high-precision activations, which fail to capture the computational capacity of practical low-precision softmax-based models. This work establishes, for the first time, that standard softmax Transformers remain Turing-complete even under low-precision activations and attention weights. The proof constructs a hardmax Transformer using ternary activations and separated attention scores to simulate a Turing machine, then equivalently transforms it into a low-precision softmax variant. Furthermore, the paper introduces a summarized chain-of-thought (CoT) paradigm, enabling model size to scale only logarithmically with space complexity. Empirical validation on Sudoku reasoning tasks confirms strong alignment between the theoretical predictions and actual learnability in practice.

📝 Abstract

Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.

Problem

Research questions and friction points this paper is trying to address.

expressivity

low precision

softmax transformers

Chain-of-Thought

Turing machines

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-precision transformers

softmax attention

Chain-of-Thought