🤖 AI Summary
This work addresses a critical gap in the theoretical understanding of Transformers by incorporating numerical analysis into their formal study. While existing theory assumes exact real-number arithmetic, practical implementations rely on finite-precision floating-point operations, which introduce rounding errors. By integrating numerical computation and function approximation theory, this paper formally demonstrates that under finite-precision floating-point arithmetic, Transformers can represent non-permutation-equivariant functions; they are capable of expressing all permutation-equivariant functions for bounded sequence lengths, but this expressivity degrades with longer sequences. Moreover, any nontrivial additive positional encoding is shown to impair their equivariant representational capacity. The study establishes, for the first time, the minimal equivariant architecture required for floating-point Transformers and provides tight upper and lower bounds on their expressive power.
📝 Abstract
The study on the expressive power of transformers shows that transformers are permutation equivariant, and they can approximate all permutation-equivariant continuous functions on a compact domain. However, these results are derived under real parameters and exact operations, while real implementations on computers can only use a finite set of numbers and inexact machine operations with round-off errors. In this work, we investigate the representability of floating-point transformers that use floating-point parameters and floating-point operations. Unlike existing results under exact operations, we first show that floating-point transformers can represent a class of non-permutation-equivariant functions even without positional encoding. Furthermore, we prove that floating-point transformers can represent all permutation-equivariant functions when the sequence length is bounded, but they cannot when the sequence length is large. We also found the minimal equivariance structure in floating-point transformers, and show that all non-trivial additive positional encoding can harm the representability of floating-point transformers.