🤖 AI Summary
This work addresses the ambiguity in existing characterizations of Transformer expressivity, which often rely on modeling assumptions that obscure the true impact of architectural choices. To rigorously analyze the roles of attention type, width, depth, and numerical precision, the paper introduces an idealized Transformer model augmented with padding symbols and leverages Boolean circuit complexity theory. The key finding is that numerical precision and depth are the primary determinants of expressive power, whereas width and attention type have limited influence. Within a unified framework, the study establishes the first equivalence between Transformers and uniform circuit classes: constant-precision Transformers correspond to L-uniform AC⁰, while increasing precision yields L-uniform TC⁰. With logᵈN-depth recurrence, these models capture FO-uniform ACᵈ and TCᵈ, respectively—results that hold for both softmax and averaged hard attention.
📝 Abstract
Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\text{L-uniform}$ constant-precision transformers are equivalent to $\text{L-uniform AC}^0$, while growing-precision ones achieve $\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach $\text{FO-uniform AC}^d$, and growing-precision ones reach $\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.