In-Context Learning of Linear Dynamical Systems with Transformers: Error Bounds and Depth-Separation

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates the function approximation capability of Transformers for noisy linear dynamical systems under in-context learning. How does Transformer depth affect its ability to learn and generalize over sequential, dynamically structured data? Method: We integrate tools from approximation theory, linear system identification, and comparative analysis with least-squares estimation. Contribution/Results: (1) We establish a depth separation phenomenon: multi-layer Transformers achieve an L² approximation error bound matching that of least-squares estimators using only logarithmic depth; (2) we prove that single-layer linear Transformers attain near-optimal convergence under i.i.d. data but suffer a fundamental, non-vanishing error lower bound under non-i.i.d. (e.g., temporally dependent) settings—revealing an intrinsic limitation in modeling dynamic structure. This is the first theoretical characterization quantifying the relationship between Transformer depth and contextual learning capacity for linear dynamical systems, offering new insights into the temporal generalization mechanisms of large language models.

Technology Category

Application Category

📝 Abstract

This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.

Problem

Research questions and friction points this paper is trying to address.

Transformers' in-context learning capability

Error bounds for multi-layer transformers

Depth-separation in dynamical systems learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers achieve logarithmic depth error bounds

Depth-separation in single-layer linear transformers

Distinct approximation power for IID vs non-IID

🔎 Similar Papers

Loss Landscape Degeneracy Drives Stagewise Development in Transformers