🤖 AI Summary
This work investigates the function approximation capability of Transformers for noisy linear dynamical systems under in-context learning. How does Transformer depth affect its ability to learn and generalize over sequential, dynamically structured data? Method: We integrate tools from approximation theory, linear system identification, and comparative analysis with least-squares estimation. Contribution/Results: (1) We establish a depth separation phenomenon: multi-layer Transformers achieve an L² approximation error bound matching that of least-squares estimators using only logarithmic depth; (2) we prove that single-layer linear Transformers attain near-optimal convergence under i.i.d. data but suffer a fundamental, non-vanishing error lower bound under non-i.i.d. (e.g., temporally dependent) settings—revealing an intrinsic limitation in modeling dynamic structure. This is the first theoretical characterization quantifying the relationship between Transformer depth and contextual learning capacity for linear dynamical systems, offering new insights into the temporal generalization mechanisms of large language models.
📝 Abstract
This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.